SlideShare a Scribd company logo
1 of 59
Download to read offline
Things, not Strings:
From Entity Extraction to Entity Resolution
David Murgatroyd
VP, Engineering
Basis Technology
Basis Technology – Human Language Technology Conference 2012   1
Motivation

Your job is to analyze reciprocal antagonism
between Christian and Islamic extremists across the
globe.
You want to find information on the Internet on
Christian extremist reaction to the killing of the U.S.
Ambassador to Libya.




Basis Technology – Human Language Technology Conference 2012   2
Basis Technology – Human Language Technology Conference 2012   4
✗	
  
✗	
  
✗	
  
Basis Technology – Human Language Technology Conference 2012   10
✗	
  




✗	
  
✗	
  




✗	
  
✓	
  
Help?


That was a lot of work.

Can text analytics help?




Basis Technology – Human Language Technology Conference 2012   14
Filter?

Filter out pages with the wrong guy?


                                                               ✗	
  




                                                               ✗	
  
                                                               ✓	
  
Basis Technology – Human Language Technology Conference 2012           15
Filter Example
Filter?

Add some filters (a/k/a facets)…


                                                               ✗	
  




                                                               ✗	
  
                                                               ✓	
  
Basis Technology – Human Language Technology Conference 2012           18
Filter?

Add some filters (a/k/a facets)…


                                                               ✗	
  




                                                               ✗	
  
                                                               ✓	
  
Basis Technology – Human Language Technology Conference 2012           19
Filter?

          Add some filters (a/k/a facets)…

Filter	
  results	
  by…	
  
People	
  
    <choice	
  1>	
                                                      ✗	
  
    <choice	
  2>	
  
    <choice	
  3>	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           20
Filter?

          But what can we use as choices?

Filter	
  results	
  by…	
  
People	
  
    	
  	
  
    <choice	
  1>	
                                                      ✗	
  
    <choice	
  2>	
  
    <choice	
  3>	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           21
Entity Extraction (Name Tagging)

Find names of person, places, organizations in document.

          	
  	
  




Basis Technology – Human Language Technology Conference 2012   22
In-document Coreference Resolution

Group names referring to the same person, within a document.




Basis Technology – Human Language Technology Conference 2012   23
Filter choices?

          But what can we use as choices?

Filter	
  results	
  by…	
  
People	
  
    <choice	
  1>	
                                                      ✗	
  
    <choice	
  2>	
  
    <choice	
  3>	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           24
Filter choices?

          Choices: first way that each person was mentioned
          in each document?
Filter	
  results	
  by…	
  
Persons	
  named	
  
    Kris	
  Stephens	
                                                   ✗	
  
    Chris	
  Stephens	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           25
Filter?

          Choices: first name string for each person in each
          document?
Filtered	
  by…	
  
Persons	
  named	
  
   Chris	
  Stephens	
                                                   ✗	
  
Add	
  filters…	
  
Persons	
  named	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           26
Filter?

          Choices: first name string for each person in each
          document?
Filtered	
  by…	
  
Persons	
  named	
  
   Chris	
  Stephens	
  

Add	
  filters…	
  
Persons	
  named	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           27
Filter?

          Problem: Ambiguity – one name, many entities


Filtered	
  by…	
  
Persons	
  named	
  
   Chris	
  Stephens	
  

Add	
  filters…	
  
Persons	
  named	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  
                                                                         ✗	
  
                                                                         ✓	
  

          Basis Technology – Human Language Technology Conference 2012           28
Filter?

         Problem: Variety – one person, many names


Filtered	
  by…	
  
Filtered	
  by…	
  
 Persons	
  named	
  
   Chris	
  Stephens	
  

Add	
  filters…	
  
Add	
  filters…	
  
Persons	
  named	
  
   Dan	
  Cathy	
  
   George	
  LiBle	
  
   …	
  
                                                                        ✗	
  
                                                                        ✓	
  

         Basis Technology – Human Language Technology Conference 2012           29
Filter?

           Problem: Variety – one person, many names


Filtered	
  by…	
  
Persons	
  named	
  
   Chris	
  Stephens	
  

Add	
  filters…	
  
Persons	
  named	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  
    Chris	
  Stevens	
  
    J.	
  Christopher	
  	
  
                                                                          ✗	
  
    	
  	
  Stevens	
  
    …	
  

                                                                          ✓	
  

           Basis Technology – Human Language Technology Conference 2012           30
Where does your favorite data set fall?




Variety	
  
                                                                  #	
  of	
  documents	
  
                                                                        Thousands	
  

                                                                         Millions	
  

                                                                         Billions	
  

              1	
  
                                            Ambiguity	
  
   Basis Technology – Human Language Technology Conference 2012                              31
Deal with ambiguity and variety?

          Magically group names by person across
          documents.
Filter	
  results	
  by…	
  
People	
  
    <choice	
  1>	
                                                      ✗	
  
    <choice	
  2>	
  
    <choice	
  3>	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           32
Labels for choices?

          But there’s still the problem of choices…

Filter	
  results	
  by…	
  
People	
  
    	
  	
  
    <choice	
  1>	
                                                      ✗	
  
    <choice	
  2>	
  
    <choice	
  3>	
  
    …	
  




                                                                         ✗	
  
                                                                         ✓	
  
          Basis Technology – Human Language Technology Conference 2012           33
Labels for choices?

           Use person’s name from highest ranked doc?
           Still some ambiguity.
Filter	
  results	
  by…	
  
People	
  
    Kris	
  Stephens	
                                                    ✗	
  
    Chris	
  Stephens	
  1	
  	
  
     	
  	
  
    Chris	
  Stephens	
  2	
  
    …	
  




                                                                          ✗	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           34
Labels for choices?

           Entity Resolution: group and also link to a
           database of known entities (e.g., Wikipedia).
Filter	
  results	
  by…	
  
People	
  
    Kris	
  Stephens	
                                                    ✗	
  
    J.	
  Christopher	
  
    Chris	
  Stephens	
  1	
  	
  
       	
  	
  
    	
  	
  	
  Stevens	
  	
  
    Chris	
  Stephens	
  2	
  
    Chris	
  
    …	
   Stephens	
  	
  
    …	
  




                                                                          ✗	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           35
Labels for choices?

           For items not in the database, infer a unique
           label (e.g., for hypothetical Wikipedia page).
Filter	
  results	
  by…	
  
People	
  
    Kris	
  Stephens	
  
       	
  	
  
                                                                          ✗	
  
    J.	
  Christopher	
  
    	
  	
  	
  Stevens	
  	
  
    Chris	
  Stephens	
  	
  
       	
  	
  
    …	
  




                                                                          ✗	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           36
Filter?

           For items not in the database, infer a unique
           label (e.g., for hypothetical Wikipedia page).
Filter	
  results	
  by…	
  
People	
  
    Kris	
  Stephens	
  
   	
  	
  	
  	
  (pastor)	
  
                                                                          ✗	
  
    J.	
  Christopher	
  
    	
  	
  	
  Stevens	
  	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
           	
   	
  
    	
  




                                                                          ✗	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           37
Filter.

           Let’s give it a try…

Filter	
  results	
  by…	
  
People	
  
    Kris	
  Stephens	
                                                    ✗	
  
    	
  	
  (pastor)	
  
    J.	
  Christopher	
  
    	
  	
  	
  Stevens	
  	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  
    	
  



                                                                          ✗	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           38
Filter.

           Let’s give it a try…


Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
                                                    ✗	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  
                                                                          ✗	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           39
Filter.

           Let’s give it a try…


Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           40
Filter.

           Let’s give it a try…


Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
                                                                          ✓	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




           Basis Technology – Human Language Technology Conference 2012           41
Filter.

           Let’s give it a try…


Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
                                                                          ✓	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  
                                                                          ✓	
  
                                                                          ✓	
  
           Basis Technology – Human Language Technology Conference 2012           42
Does it work?

How do you measure?




Basis Technology – Human Language Technology Conference 2012   43
How do you measure?

           Imagine this was the result of applying the filter with
           the name from wikipedia.
Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




           Basis Technology – Human Language Technology Conference 2012   44
How do you measure?

           Precision: for each document, how much of the stuff
           grouped with it is correct?
Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
  
                                                                          ✗	
        	
  1	
  /	
  3	
  =	
  33%	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
                                                          ✓	
           2	
  /	
  3	
  =	
  67%	
  
                                                                                         	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
                                                                          ✓	
   	
  2	
  /	
  3	
  =	
  67%	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




           Basis Technology – Human Language Technology Conference 2012                                   45
How do you measure?

           Recall: for each document, how much of the correct
           stuff is grouped with?
Filtered	
  by…	
  
People	
  
   J.	
  Christopher	
  
   	
  	
  	
  Stevens	
  	
  

Add	
  filters…	
                                                          ✓	
           2	
  /	
  5	
  =	
  40%	
  
                                                                                         	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    Chris	
  Stephens	
  
                                                                          ✓	
   	
  2	
  /	
  5	
  =	
  40%	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
                                                        ✗	
  
    George	
  LiBle	
  
    …	
  
                                                                          ✗	
  
                                                                          ✗	
  
           Basis Technology – Human Language Technology Conference 2012                                 46
Does it work?

We often combine Precision and Recall
measurements into a single
measurement, called “F”.




Basis Technology – Human Language Technology Conference 2012   47
Where does your favorite data set fall?




Variety	
  
                                                                  #	
  of	
  documents	
  
                                                                        Thousands	
  

                                                                         Millions	
  

                                                                         Billions	
  

              1	
  
                                            Ambiguity	
  
   Basis Technology – Human Language Technology Conference 2012                              48
Where does your favorite data lie?
                                                      corpus	
  
                                                       ACE	
  2005	
                WEPS-­‐2	
                   TAC	
  pre-­‐2012	
  
                                                       TAC	
  eng	
  2012	
         TAC	
  zho	
  2012	
         TAC	
  spa	
  2012	
  
                                                       Basis	
  Balanced	
          Basis	
  Ambig	
             Basis	
  Variance	
  1	
  
                                                       Basis	
  Variance	
  2	
  




                                                                F>=?	
  

Variety	
                                F>=70	
  

                                                                                                             #	
  of	
  documents	
  
                                                                                                                    Thousands	
  

                                                                                                                     Millions	
  

                                                                                                                     Billions	
  
                      F>=85	
  

              1	
  
                                            Ambiguity	
  
   Basis Technology – Human Language Technology Conference 2012                                                                               49
Trading off Errors

           Let’s pretend you’re researching the pastors
           instead.
Filter	
  results	
  by…	
  
People	
  
    Kris	
  Stephens	
  
    	
  	
  (pastor)	
  
    J.	
  Christopher	
  
    	
  	
  	
  Stevens	
  	
  
    Chris	
  Stephens	
  
    	
  	
  	
  (pastor)	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  
    	
  




           Basis Technology – Human Language Technology Conference 2012   50
Trading off Errors

           What if you think there are too many (or too few)?
           Add a slider for making filter more fine (or coarse).
Filtered	
  by…	
  
People	
  
   Kris	
  Stephens	
  
   	
  	
  (pastor)	
  	
  

Add	
  filters…	
  
People	
  
    J.	
  Christopher	
  
    	
  	
  	
  Stevens	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




          Basis Technology – Human Language Technology Conference 2012   51
Trading off Errors

           Make the filter more fine.


Filtered	
  by…	
  
People	
  
   Kris	
  Stephens	
  
   	
  	
  (pastor)	
  	
  

Add	
  filters…	
  
People	
  
    J.	
  Christopher	
  
    	
  	
  	
  Stevens	
  
    Chris	
  Stephens	
  
    	
  	
  (pastor)	
  	
  
    Dan	
  Cathy	
  
    George	
  LiBle	
  
    …	
  




          Basis Technology – Human Language Technology Conference 2012   52
Demo
Questions


•  Suggested questions:
   –  Doesn’t Google already do this?
   –  Speed? Scale?
   –  Multi-lingual?
   –  What other uses are there for entity resolution
      beyond faceted search?




Basis Technology – Human Language Technology Conference 2012   54
Thank you!



For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090

Basis Technology – Human Language Technology Conference 2012   55
Doesn’t	
  Google	
  already	
  do	
  this?	
  

Some, when searching for famous entities.




Basis Technology – Human Language Technology Conference 2012   56
Speed/Scale


•  Support from BRAVE for scale in CY13!
•  Research version:
   –  tested up to 1m docs
   –  Sub-second per document
   –  Incremental updates (i.e., you see documents
      published minutes ago)




Basis Technology – Human Language Technology Conference 2012   57
Doesn’t	
  Google	
  already	
  do	
  this?	
  




Basis Technology – Human Language Technology Conference 2012   58
Other uses for entity resolution ?
•  Supporting relationship resolution by resolving
   participating entities in the them.
•  Knowledge base population
•  Integrating disparate data sets
•  Alerting
•  Improving relevance of search results
•  Predictive Analytics




Basis Technology – Human Language Technology Conference 2012   59

More Related Content

Viewers also liked

The Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsThe Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsAlyona Medelyan
 
Building Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoBuilding Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoAshnikbiz
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Roland Bouman
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Uday Kothari
 
Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BIEdureka!
 
Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-wekalucboudreau
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introductionmattcasters
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
 
Understanding text-structure-powerpoint
Understanding text-structure-powerpointUnderstanding text-structure-powerpoint
Understanding text-structure-powerpointaelowans
 

Viewers also liked (11)

High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
The Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsThe Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text Analytics
 
Building Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using PentahoBuilding Data Integration and Transformations using Pentaho
Building Data Integration and Transformations using Pentaho
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Pentaho-BI
Pentaho-BIPentaho-BI
Pentaho-BI
 
Slides pentaho-hadoop-weka
Slides pentaho-hadoop-wekaSlides pentaho-hadoop-weka
Slides pentaho-hadoop-weka
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014Querying your database in natural language by Daniel Moisset PyData SV 2014
Querying your database in natural language by Daniel Moisset PyData SV 2014
 
Understanding text-structure-powerpoint
Understanding text-structure-powerpointUnderstanding text-structure-powerpoint
Understanding text-structure-powerpoint
 

More from Basis Technology

Product Update: Customization with Rosette
Product Update: Customization with RosetteProduct Update: Customization with Rosette
Product Update: Customization with RosetteBasis Technology
 
Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Basis Technology
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Basis Technology
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetupBasis Technology
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Basis Technology
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology
 
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldHLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldBasis Technology
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierBasis Technology
 
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesOSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesBasis Technology
 
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...Basis Technology
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydBasis Technology
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformBasis Technology
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceBasis Technology
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceBasis Technology
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBasis Technology
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceBasis Technology
 

More from Basis Technology (17)

Product Update: Customization with Rosette
Product Update: Customization with RosetteProduct Update: Customization with Rosette
Product Update: Customization with Rosette
 
Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020Smart Matching for Screening Webinar - May 2020
Smart Matching for Screening Webinar - May 2020
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
Rosette Product Update (May 2019)
Rosette Product Update (May 2019)Rosette Product Update (May 2019)
Rosette Product Update (May 2019)
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetup
 
Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014Gregor Stewart - OSIRA 2014
Gregor Stewart - OSIRA 2014
 
Basis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in JapanBasis Technology showcase at elasticsearch meetup in Japan
Basis Technology showcase at elasticsearch meetup in Japan
 
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff GodboldHLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
HLT 2013 - Big Data Navigation and Discovery by Stefan Andreasen & Jeff Godbold
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
 
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson MarguliesOSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
OSS 2013 - Real World Facets with Entity Resolution by Benson Margulies
 
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emergin...
 
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David MurgatroydHLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
HLT 2013 - From Research to Reality: Advances in HLT by David Murgatroyd
 
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics PlatformAutopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
Autopsy 3: Free Open Source End-to-End Windows-based Digital Forensics Platform
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
 
Autopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics ConferenceAutopsy 3.0 - Open Source Digital Forensics Conference
Autopsy 3.0 - Open Source Digital Forensics Conference
 
Big Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology ConferenceBig Data Triage with Rosette Human Language Technology Conference
Big Data Triage with Rosette Human Language Technology Conference
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Entity Resolution and Text Analytics Help Analyze Extremist Reactions

  • 1. Things, not Strings: From Entity Extraction to Entity Resolution David Murgatroyd VP, Engineering Basis Technology Basis Technology – Human Language Technology Conference 2012 1
  • 2. Motivation Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya. Basis Technology – Human Language Technology Conference 2012 2
  • 3.
  • 4. Basis Technology – Human Language Technology Conference 2012 4
  • 6.
  • 8.
  • 10. Basis Technology – Human Language Technology Conference 2012 10
  • 12.
  • 14. Help? That was a lot of work. Can text analytics help? Basis Technology – Human Language Technology Conference 2012 14
  • 15. Filter? Filter out pages with the wrong guy? ✗   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 15
  • 16.
  • 18. Filter? Add some filters (a/k/a facets)… ✗   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 18
  • 19. Filter? Add some filters (a/k/a facets)… ✗   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 19
  • 20. Filter? Add some filters (a/k/a facets)… Filter  results  by…   People   <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 20
  • 21. Filter? But what can we use as choices? Filter  results  by…   People       <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 21
  • 22. Entity Extraction (Name Tagging) Find names of person, places, organizations in document.     Basis Technology – Human Language Technology Conference 2012 22
  • 23. In-document Coreference Resolution Group names referring to the same person, within a document. Basis Technology – Human Language Technology Conference 2012 23
  • 24. Filter choices? But what can we use as choices? Filter  results  by…   People   <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 24
  • 25. Filter choices? Choices: first way that each person was mentioned in each document? Filter  results  by…   Persons  named   Kris  Stephens   ✗   Chris  Stephens   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 25
  • 26. Filter? Choices: first name string for each person in each document? Filtered  by…   Persons  named   Chris  Stephens   ✗   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 26
  • 27. Filter? Choices: first name string for each person in each document? Filtered  by…   Persons  named   Chris  Stephens   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 27
  • 28. Filter? Problem: Ambiguity – one name, many entities Filtered  by…   Persons  named   Chris  Stephens   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 28
  • 29. Filter? Problem: Variety – one person, many names Filtered  by…   Filtered  by…   Persons  named   Chris  Stephens   Add  filters…   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 29
  • 30. Filter? Problem: Variety – one person, many names Filtered  by…   Persons  named   Chris  Stephens   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Chris  Stevens   J.  Christopher     ✗      Stevens   …   ✓   Basis Technology – Human Language Technology Conference 2012 30
  • 31. Where does your favorite data set fall? Variety   #  of  documents   Thousands   Millions   Billions   1   Ambiguity   Basis Technology – Human Language Technology Conference 2012 31
  • 32. Deal with ambiguity and variety? Magically group names by person across documents. Filter  results  by…   People   <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 32
  • 33. Labels for choices? But there’s still the problem of choices… Filter  results  by…   People       <choice  1>   ✗   <choice  2>   <choice  3>   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 33
  • 34. Labels for choices? Use person’s name from highest ranked doc? Still some ambiguity. Filter  results  by…   People   Kris  Stephens   ✗   Chris  Stephens  1         Chris  Stephens  2   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 34
  • 35. Labels for choices? Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia). Filter  results  by…   People   Kris  Stephens   ✗   J.  Christopher   Chris  Stephens  1              Stevens     Chris  Stephens  2   Chris   …   Stephens     …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 35
  • 36. Labels for choices? For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page). Filter  results  by…   People   Kris  Stephens       ✗   J.  Christopher        Stevens     Chris  Stephens         …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 36
  • 37. Filter? For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page). Filter  results  by…   People   Kris  Stephens          (pastor)   ✗   J.  Christopher        Stevens     Chris  Stephens      (pastor)           ✗   ✓   Basis Technology – Human Language Technology Conference 2012 37
  • 38. Filter. Let’s give it a try… Filter  results  by…   People   Kris  Stephens   ✗      (pastor)   J.  Christopher        Stevens     Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …     ✗   ✓   Basis Technology – Human Language Technology Conference 2012 38
  • 39. Filter. Let’s give it a try… Filtered  by…   People   J.  Christopher   ✗        Stevens     Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   ✗   ✓   Basis Technology – Human Language Technology Conference 2012 39
  • 40. Filter. Let’s give it a try… Filtered  by…   People   J.  Christopher        Stevens     Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   ✓   Basis Technology – Human Language Technology Conference 2012 40
  • 41. Filter. Let’s give it a try… Filtered  by…   People   J.  Christopher        Stevens     Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     ✓   Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 41
  • 42. Filter. Let’s give it a try… Filtered  by…   People   J.  Christopher        Stevens     Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     ✓   Dan  Cathy   George  LiBle   …   ✓   ✓   Basis Technology – Human Language Technology Conference 2012 42
  • 43. Does it work? How do you measure? Basis Technology – Human Language Technology Conference 2012 43
  • 44. How do you measure? Imagine this was the result of applying the filter with the name from wikipedia. Filtered  by…   People   J.  Christopher        Stevens     Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 44
  • 45. How do you measure? Precision: for each document, how much of the stuff grouped with it is correct? Filtered  by…   People   J.  Christopher   ✗    1  /  3  =  33%        Stevens     Add  filters…   ✓   2  /  3  =  67%     People   Kris  Stephens      (pastor)   Chris  Stephens   ✓    2  /  3  =  67%      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 45
  • 46. How do you measure? Recall: for each document, how much of the correct stuff is grouped with? Filtered  by…   People   J.  Christopher        Stevens     Add  filters…   ✓   2  /  5  =  40%     People   Kris  Stephens      (pastor)   Chris  Stephens   ✓    2  /  5  =  40%      (pastor)     Dan  Cathy   ✗   George  LiBle   …   ✗   ✗   Basis Technology – Human Language Technology Conference 2012 46
  • 47. Does it work? We often combine Precision and Recall measurements into a single measurement, called “F”. Basis Technology – Human Language Technology Conference 2012 47
  • 48. Where does your favorite data set fall? Variety   #  of  documents   Thousands   Millions   Billions   1   Ambiguity   Basis Technology – Human Language Technology Conference 2012 48
  • 49. Where does your favorite data lie? corpus   ACE  2005   WEPS-­‐2   TAC  pre-­‐2012   TAC  eng  2012   TAC  zho  2012   TAC  spa  2012   Basis  Balanced   Basis  Ambig   Basis  Variance  1   Basis  Variance  2   F>=?   Variety   F>=70   #  of  documents   Thousands   Millions   Billions   F>=85   1   Ambiguity   Basis Technology – Human Language Technology Conference 2012 49
  • 50. Trading off Errors Let’s pretend you’re researching the pastors instead. Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens        (pastor)   Dan  Cathy   George  LiBle   …     Basis Technology – Human Language Technology Conference 2012 50
  • 51. Trading off Errors What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse). Filtered  by…   People   Kris  Stephens      (pastor)     Add  filters…   People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 51
  • 52. Trading off Errors Make the filter more fine. Filtered  by…   People   Kris  Stephens      (pastor)     Add  filters…   People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Basis Technology – Human Language Technology Conference 2012 52
  • 53. Demo
  • 54. Questions •  Suggested questions: –  Doesn’t Google already do this? –  Speed? Scale? –  Multi-lingual? –  What other uses are there for entity resolution beyond faceted search? Basis Technology – Human Language Technology Conference 2012 54
  • 55. Thank you! For more information: Visit www.basistech.com Write to conference@basistech.com Call 617-386-2090 Basis Technology – Human Language Technology Conference 2012 55
  • 56. Doesn’t  Google  already  do  this?   Some, when searching for famous entities. Basis Technology – Human Language Technology Conference 2012 56
  • 57. Speed/Scale •  Support from BRAVE for scale in CY13! •  Research version: –  tested up to 1m docs –  Sub-second per document –  Incremental updates (i.e., you see documents published minutes ago) Basis Technology – Human Language Technology Conference 2012 57
  • 58. Doesn’t  Google  already  do  this?   Basis Technology – Human Language Technology Conference 2012 58
  • 59. Other uses for entity resolution ? •  Supporting relationship resolution by resolving participating entities in the them. •  Knowledge base population •  Integrating disparate data sets •  Alerting •  Improving relevance of search results •  Predictive Analytics Basis Technology – Human Language Technology Conference 2012 59