SlideShare uma empresa Scribd logo
1 de 61
Linking data without common identifiers


Javazone 2012, 2012-09-12
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga




1
Identity resolution




2
The problem

 • How to tell if two different records represent
   the same real-world entity?
DBPEDIA                                              MONDIAL

Id              http://dbpedia.org/resource/Samoa    Id             17019

Name            Samoa                                Name           Western Samoa

Founding date   1962-01-01                           Independence   01 01 1962

Capital         Apia                                 Capital        Apia, Samoa

Currency        Tala                                 Population     214384

Area            2831                                 Area           2860

Leader name     Tuilaepa Aiono Sailele Malielegaoi   GDP            415




     3
Linking across datasets



                              ?
              A                               B




• Independent applications, for example
    – even when unique identifiers exist, they tend to be
      unreliable, or missing


4
Within datasets




                              ?
                       A             B




• Duplicate records in the same application
    – in different tables (poor modelling)
    – in the same table (poor UI, poor logic, sloppy entry)

5
In data interchange



                          ?
               A                        B


              XML




• Receiving data from third parties
    – do we have this record already?



6
A difficult problem

• It requires O(n2) comparisons for n records
    – a million comparisons for 1000 records, 100 million
      for 10,000, ...
• Exact string comparison is not enough
    – must handle misspellings, name variants, etc
• Interpreting the data can be difficult even for a
  human being
    – is the address different because there are two
      different people, or because the person moved?
• ...
7
Record linkage

• Statisticians very often must connect data sets
  from different sources
• They call it “record linkage”
    – term coined in 19461)
    – mathematical foundations laid in 19592)
    – formalized in 1969 as “Fellegi-Sunter” model3)
• A whole subject area has been developed with
  well-known techniques, methods, and tools
    – these can of course be applied outside of statistics

     1) http://ajph.aphapublications.org/cgi/reprint/36/12/1412
8    2) http://www.sciencemag.org/content/130/3381/954.citation
     3) http://www.jstor.org/pss/2286061
The basic idea

     Field       Record 1    Record 2      Probability
     Name        acme inc    acme inc      0.9
     Assoc no    177477707                 0.5
     Zip code    9161        9161          0.6
     Country     norway      norway        0.51
     Address 1   mb 113      mailbox 113   0.49
     Address 2                             0.5
                                           0.931




9
Standard algorithm

• n2 comparisons for n records is unacceptable
     – must reduce number of direct comparisons
• Solution
     –   produce a key from field values,
     –   sort records by the key,
     –   for each record, compare with n nearest neighbours
     –   sometimes several different keys are produced, to increase
         chances of finding matches
• Downsides
     – requires coming up with a key
     – difficult to apply incrementally
     – sorting is expensive

10
String comparison

• Measure “distance” between strings
     – must handle spelling errors and name variants
     – must estimate probability that values belong to same entity
       despite differences
• Examples
     – John Smith ≈ Jonh Smith
     – J. Random Hacker ≈ James Random Hacker
• Many well-known algorithms have been developed
     – no one best choice, unfortunately
     – some are eager to merge, others less so
• Many are computationally expensive
     – O(n2) where n is length of string is common
     – this gets very costly when number of record pairs to compare is
       already high
11
Existing tools

• Commercial tools
     – big, sophisticated, and expensive
     – seem to follow the same principles
     – presumably also effective
• Open source tools
     –   generally made by and for statisticians
     –   nice user interfaces and rich configurability
     –   advanced maths
     –   architecture often not as flexible as it could be


12
Duke

DUplicate KillEr




13
Context

• Doing a project for Hafslund where we
  integrate data from many sources
• Entities are duplicated both inside systems and
  across systems

       Suppliers

      Customers      Customers       Customers

      Companies

         ERP           CRM            Billing


14
Requirements

• Must be flexible and configurable
     – no way to know in advance exactly what data we will need
       to deduplicate
• Must scale to large data sets
     – CRM alone has 1.4 million customer records
     – that‟s 2 trillion comparisons with naïve approach
• Must have an API
     – project uses SDshare protocol everywhere
     – must therefore be able to build connectors
• Must be able to work incrementally
     – process data as it changes and update conclusions on the
       fly

15
Reviewed existing tools...

• ...but didn‟t find anything suitable for our needs
• Therefore...




16
Duke

• Java deduplication engine
     – released as open source
     – http://code.google.com/p/duke/
• Does not use key approach
     – instead indexes data with Lucene
     – does Lucene searches to find potential matches
• Still a work in progress, but
     –   high performance,
     –   several data sources and comparators,
     –   being used for real in real projects,
     –   flexible architecture
17
How it works

         DataSource
     Extrac
              Map   Clean
       t


         DataSource
     Extrac
              Map   Clean
                            Index     Search   Compare   Link
       t


         DataSource
     Extrac
              Map   Clean
       t




                                Lucene index



18
Cleaning of data
            Côte D’Ivoire      cote d’ivoire

            Main St 113        113 main street

            Dec 25 1973        1973-12-25


• Used to normalize data from sources
     – necessary to get rid of differences that don‟t matter
• Makes comparing much more effective
     – pluggable in Duke
     – utilities and components already provided


19
Components

Data sources           Comparators
• CSV                  • ExactComparator
• JDBC                 • NumericComparator
• Sparql               • SoundexComparator
• NTriples             • TokenizedComparator
• <plug in your own>   • Levenshtein
                       • JaroWinkler
                       • Dice coefficient
                       • <plug in your own>

20
Features

• XML syntax for configuration
• Fairly complete command-line tool
     – with debugging support
• API for embedding
• Pluggable data sources, comparators, and
  cleaners
• Framework for composing cleaners
• Incremental processing
• High performance
21
Two modes




                     ?                    ?




     Deduplication       Record linkage

22
Why probabilities?

• Some tools use rules instead
     – if field1 and field2 match, then ...
     – if field1 and (field3 or field4) match, then ...
• This is easy to understand, but
     – there are many, many combinations
     – doesn‟t take inexact matching into account
     – adding another field becomes a real pain




23
Probabilities weigh all the evidence

     Field        Record 1      Record 2      Probability Accum.
     Name         acme inc      acme inc      0.9          0.9
     Assoc no     177477707                   0.5          0.9
     Zip code     9161          9161          0.6          0.931
     Country      norway        norway        0.51         0.934
     Address 1    mb 113        mailbox       0.3          0.857
                                113
     Address 2                                0.5          0.857



     Probabilities combined with Bayes Theorem

     Probabilities reduced if values don’t match exactly


24
Linking Mondial and DBpedia

A real-world example




25   http://code.google.com/p/duke/wiki/LinkingCountries
Finding properties to match

 • Need properties providing identity evidence
 • Matching on the properties in bold below
 • Extracted data to CSV for ease of use
DBPEDIA                                              MONDIAL

Id              http://dbpedia.org/resource/Samoa    Id             17019

Name            Samoa                                Name           Western Samoa

Founding date   1962-01-01                           Independence   01 01 1962

Capital         Apia                                 Capital        Apia, Samoa

Currency        Tala                                 Population     214384

Area            2831                                 Area           2860

Leader name     Tuilaepa Aiono Sailele Malielegaoi   GDP            415


     26
Configuration – data sources
<group>                                            <group>
 <csv>                                               <csv>
  <param name="input-file" value="dbpedia.csv"/>      <param name="input-file" value="mondial.csv"/>
  <param name="header-line" value="false"/>
                                                     <column name="id" property="ID"/>
  <column name="1" property="ID"/>                   <column name="country"
  <column name="2"                                       cleaner="no.priv...examples.CountryNameCleaner"
      cleaner="no.priv...CountryNameCleaner"             property="NAME"/>
      property="NAME"/>                              <column name="capital"
  <column name="3"                                       cleaner="no.priv...LowerCaseNormalizeCleaner"
      property="AREA"/>                                  property="CAPITAL"/>
  <column name="4"                                   <column name="area"
      cleaner="no.priv...CapitalCleaner"                 property="AREA"/>
      property="CAPITAL"/>                          </csv>
 </csv>                                            </group>
</group>




               Using groups tells Duke that we are linking across two data sets,
               not deduplicating by comparing all records against all others

27
Configuration – matching
<schema>
 <threshold>0.65</threshold>
                                                          Duke analyzes this setup and decides
 <property type="id">                                     only NAME and CAPITAL need to be
   <name>ID</name>                                        searched on in Lucene.
 </property>
 <property>
   <name>NAME</name>
   <comparator>no.priv.garshol.duke.Levenshtein</comparator>
   <low>0.3</low>
   <high>0.88</high>
 </property>
 <property>
   <name>AREA</name>
   <comparator>AreaComparator</comparator>
   <low>0.2</low>                                       <object class="no.priv.garshol.duke.NumericComparator"
   <high>0.6</high>                                          name="AreaComparator">
 </property>                                              <param name="min-ratio" value="0.7"/>
 <property>                                             </object>
   <name>CAPITAL</name>
   <comparator>no.priv.garshol.duke.Levenshtein</comparator>
   <low>0.4</low>
   <high>0.88</high>
 </property>
</schema>

28
Result

• Correct links found: 206 / 217 (94.9%)
• Wrong links found: 0 / 12 (0.0%)
• Unknown links found: 0
• Percent of links correct 100.0%, wrong 0.0%,
  unknown 0.0%
• Records with no link: 25
• Precision 100.0%, recall 94.9%, f-number
  0.974


29
Examples
Field         DBpedia         Mondial         Field         DBpedia         Mondial
Name          albania         albania         Name          kazakhstan      kazakstan
Area          28748           28750           Area          2724900         2717300
Capital       tirana          tirane          Capital       astana          almaty
Probability   0.980                           Probability   0.838

Field         DBpedia         Mondial         Field         DBpedia         Mondial
Name          côte d'ivoire   cote divoire    Name          grande comore   comoros
Area          322460          322460          Area          1148            2170
Capital       yamoussoukro    yamoussoukro    Capital       moroni          moroní
Probability   0.975                           Probability   0.440

Field         DBpedia         Mondial         Field         DBpedia         Mondial
Name          samoa           western samoa   Name          serbia          serbia and mont
Area          2831            2860            Area          102350          88361
Capital       apia            apia            Capital       sarajevo        sarajevo
Probability   0.824                           Probability   0.440

30
Western Samoa or American Samoa?

     Field         DBpedia   Mondial          Probability
     Name          samoa     western samoa    0.3
     Area          2831      2860             0.6
     Capital       apia      apia             0.88
     Probability                              0.824



     Field         DBpedia   Mondial          Probability
     Name          samoa     american samoa   0.3
     Area          2831      199              0.4
     Capital       apia      pago pago        0.4
     Probability                              0.067




31
An example of failure
                                          Field         DBpedia      Mondial
• Duke doesn‟t find this match            Name          kazakhstan   kazakstan
                                          Area          2724900      2717300
     – no tokens matching exactly         Capital       astana       almaty

     – Lucene search finds nothing        Probability   0.838


• Detailed comparison gives correct result
     – the problem is Lucene search
• Lucene does have Levenshtein search, but
     –   in Lucene 3.x it‟s very slow
     –   therefore not enabled now
     –   thinking of adding option to enable where needed
     –   Lucene 4.x will fix the performance problem
32
The effects of value matching

     Case           Precision   Recall   F
     Optimal        100%        94.9%    97.4%
     Cleaning,     99.3%        73.3%    84.4%
     exact compare
     No cleaning,   100%        93.4%    96.6%
     good compare
     No cleaning,  99.3%        70.9%    82.8%
     exact compare
     Cleaning,      97.6%       96.7%    97.2%
     JaroWinkler
     Cleaning,      95.9%       97.2%    96.6%
     Soundex




33
Usage at Hafslund

Duke in real life




34
The big picture
                                                                         DUPLICATES!
                         SDshare    Public     SDshare
                                     360




     Search        SDshare                                         SDshare
                                    Virtuoso
                                                                               ERP
     engine                          (RDF)


                                                                               CRM



                          SDshare                                             Billing
                                     Duke      SDshare
                                               contains owl:sameAs and
                                               haf:possiblySameAs
     Can override here              LinkDB

35
Experiences so far

• Incremental processing works fine
     – links added and retracted as data changes
• Performance not an issue at all
     – despite there being ~1.4 million records
     – requires a little bit of tuning, however
• Matching works well, but not perfect
     – data are very noisy and messy
     – biggest issue is clusters caused by generic values
     – also, matching is hard


36
Approximate string comparison

Overview of algorithms




37
Levenshtein

• Also known as edit distance
• Measures the number of edit operations
  necessary to turn s1 into s2
• Edit operations are
     – insert a character
     – remove a character
     – substitute a character
• Complexity
     – O(n2)     – naïve algorithm
     – O(n + d2) – where d = distance (with optimizations)

38
Levenshtein example

• Levenshtein -> Löwenstein
     – Levenstein (remove „h‟)
     – Lövenstein (substitute „ö‟)
     – Löwenstein (substitute „w‟)
• Edit distance = 3




39
Weighted Levenshtein

• Not all edit operations are equal
     – substituting “i” for “e” is a smaller edit than
       substituting “o” for “k”
• Weighted Levenshtein evaluates each edit
  operation as a number 0.0-1.0
• Weights depend on context
     – language, for example
• Slower, because some Levenshtein
  optimizations do not apply with weights

40
Jaro-Winkler

• Developed at the US Bureau of the Census
• For name comparisons
     – not well suited to long strings
     – best if given name/surname are separated
• Exists in a few variants
     – originally proposed by Winkler
     – then modified by Jaro
     – a few different versions of modifications etc



41
Jaro-Winkler definition

• Formula:
     – m = number of matching characters
     – t = number of transposed characters
• A character from string s1 matches s2 if the
  same character is found in s2 less then half the
  length of the string away
• Levenshtein ~ Löwenstein = 0.8
• Axel ~ Aksel = 0.783


42
Jaro-Winkler variant




43
Soundex

• A coarse schema for matching names by sound
     – produces a key from the name
     – names match if key is the same
• In common use in many places
     – Nav‟s person register uses it for search
     – built-in in many databases
     – ...




44
Soundex definition




45
Examples

•    soundex(“Axel”) = „A240‟
•    soundex(“Aksel”) = „A240‟
•    soundex(“Levenshtein”) = „L523‟
•    soundex(“Löwenstein”) = „L152‟




46
Metaphone

• Developed by Lawrence Philips
• Similar to Soundex, but much more complex
     – both more accurate and more sensitive
• Developed further into Double Metaphone
• Metaphone 3.0 also exists, but only available
  commercially




47
Metaphone examples

•    metaphone(“Axel”) = „AKSL‟
•    metaphone(“Aksel”) = „AKSL‟
•    metaphone(“Levenshtein”) = „LFNX‟
•    metaphone(“Löwenstein”) = „LWNS‟




48
Norphone

• I‟m working on a Norwegian alternative to
  Metaphone
     – designed to handle phonetic rules in Norweigan
       names
     – for example, “ch” = “k”
• Work in progress
     – http://code.google.com/p/duke/source/browse/src/
       main/java/no/priv/garshol/duke/comparators/Norp
       honeComparator.java



49
Dice coefficient

• A similarity measure for sets
     – set can be tokens in a string
     – or characters in a string
• Formula:




50
TFIDF

• Compares strings as sets of tokens
     – a la Dice coefficient
• However, takes frequency of tokens in corpus
  into account
     – this matches how we evaluate matches mentally
• Has done well in evaluations
     – however, can be difficult to evaluate
     – results will change as corpus changes



51
More comparators

• Smith-Waterman
     – originated in DNA sequencing
• Q-grams distance
     – breaks string into sets of pieces of q characters
     – then does set similarity comparison
• Monge-Elkan
     – similar to Smith-Waterman, but with affine gap distances
     – has done very well in evaluations
     – costly to evaluate
• Many, many more
     – ...

52
Performance




53
Why performance matters

• Performance really is a major issue
     – O(n2) behaviour
     – slow comparators
• An example
     – linking 840‟000 authors from Deichmann library
       with 200‟000 person records from DBpedia
     – with naïve in-memory backend this takes ~4 secs per
       record in second group
     – that‟s 9.5 days to do the whole job...



54
Bigger data




     So... 9.5 days for ~400,000 records

     Time to process increases with square of record count

     These guys can expect to spend 585 years...




55
Obvious solution #1

• Where is the program spending its time?
     – mostly in Levenshtein string comparisons
     – these actually take a while
• Solution
     – switch from n x n array to single n2 array (30% off)
     – break off comparison if we‟re going to go over
       difference limit, anyway (another 30% off)
• Overall speed improvement ~50%
     – so, from 9.5 days to 4.25 days
     – or, from 585 years to 293 years

56
Obvious solution #2

• Parallel computing
     – indexing takes just a few minutes (for small case)
     – record comparisons are independent of each other
     – so, spin up different computers and let them work in
       parallell
• Does it work?
     –   well, let‟s say we use 20 nodes
     –   enough to cost a bit, and give us some admin work
     –   not too many to cause communication issues
     –   expect speed-up by a factor of 10
• That gives us
     – from 4.5 days to 10 hours    Didn’t actually do this...
     – from 293 years to 29 years

57
Obvious solution #3

• Improve the algorithm!
     – that is, use the Lucene backend
     – now we only compare records matched by searches
     – dramatically cuts number of record pairs
• But does it work?
     – Deichmann case now runs in 2 hours (single machine)
     – performance still not linear
     – Mexican runtime therefore probably still too long
       (hence their email)


58
Final solution

• Searches can still return huge numbers of
  potential candidates
     – most relevant ones at the top
     – so, we can configure cutoffs (by relevance or number)
• Setting
     –   min-relevance: 0.99
     –   max-search-hits: 10
     –   runs Deichmann in 3 minutes (98.2% accuracy)
     –   performance now close to linear
     –   should work for the Mexicans, too

59
Next step

• We can do some more tweaking with Lucene
     – can use boosting to improve relevance
     – then can cut more records without losing recall
     – there are limits, however
• So next step may well be parallel computation
     – should be fun 




60
Comments/questions?

Slides will appear on http://slideshare.net/larsga




61

Mais conteúdo relacionado

Mais procurados

Query Processing with k-Anonymity
Query Processing with k-AnonymityQuery Processing with k-Anonymity
Query Processing with k-Anonymity
Waqas Tariq
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
Marko Rodriguez
 

Mais procurados (20)

Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Linked Data Fragments
Linked Data FragmentsLinked Data Fragments
Linked Data Fragments
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Query Processing with k-Anonymity
Query Processing with k-AnonymityQuery Processing with k-Anonymity
Query Processing with k-Anonymity
 
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
 
JPJ1422 Fast Nearest Neighbour Search With Keywords
JPJ1422   Fast Nearest Neighbour Search With KeywordsJPJ1422   Fast Nearest Neighbour Search With Keywords
JPJ1422 Fast Nearest Neighbour Search With Keywords
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Caspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve RenkinCaspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve Renkin
 

Semelhante a Linking data without common identifiers

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 

Semelhante a Linking data without common identifiers (20)

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
The Entity Registry System @ Verisign Labs, 2013
The Entity Registry System @ Verisign Labs, 2013The Entity Registry System @ Verisign Labs, 2013
The Entity Registry System @ Verisign Labs, 2013
 
Evolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/SpecialistEvolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/Specialist
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Data Science
Data ScienceData Science
Data Science
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Grails goes Graph
Grails goes GraphGrails goes Graph
Grails goes Graph
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

Mais de Lars Marius Garshol

Mais de Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Linking data without common identifiers

  • 1. Linking data without common identifiers Javazone 2012, 2012-09-12 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1
  • 3. The problem • How to tell if two different records represent the same real-world entity? DBPEDIA MONDIAL Id http://dbpedia.org/resource/Samoa Id 17019 Name Samoa Name Western Samoa Founding date 1962-01-01 Independence 01 01 1962 Capital Apia Capital Apia, Samoa Currency Tala Population 214384 Area 2831 Area 2860 Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415 3
  • 4. Linking across datasets ? A B • Independent applications, for example – even when unique identifiers exist, they tend to be unreliable, or missing 4
  • 5. Within datasets ? A B • Duplicate records in the same application – in different tables (poor modelling) – in the same table (poor UI, poor logic, sloppy entry) 5
  • 6. In data interchange ? A B XML • Receiving data from third parties – do we have this record already? 6
  • 7. A difficult problem • It requires O(n2) comparisons for n records – a million comparisons for 1000 records, 100 million for 10,000, ... • Exact string comparison is not enough – must handle misspellings, name variants, etc • Interpreting the data can be difficult even for a human being – is the address different because there are two different people, or because the person moved? • ... 7
  • 8. Record linkage • Statisticians very often must connect data sets from different sources • They call it “record linkage” – term coined in 19461) – mathematical foundations laid in 19592) – formalized in 1969 as “Fellegi-Sunter” model3) • A whole subject area has been developed with well-known techniques, methods, and tools – these can of course be applied outside of statistics 1) http://ajph.aphapublications.org/cgi/reprint/36/12/1412 8 2) http://www.sciencemag.org/content/130/3381/954.citation 3) http://www.jstor.org/pss/2286061
  • 9. The basic idea Field Record 1 Record 2 Probability Name acme inc acme inc 0.9 Assoc no 177477707 0.5 Zip code 9161 9161 0.6 Country norway norway 0.51 Address 1 mb 113 mailbox 113 0.49 Address 2 0.5 0.931 9
  • 10. Standard algorithm • n2 comparisons for n records is unacceptable – must reduce number of direct comparisons • Solution – produce a key from field values, – sort records by the key, – for each record, compare with n nearest neighbours – sometimes several different keys are produced, to increase chances of finding matches • Downsides – requires coming up with a key – difficult to apply incrementally – sorting is expensive 10
  • 11. String comparison • Measure “distance” between strings – must handle spelling errors and name variants – must estimate probability that values belong to same entity despite differences • Examples – John Smith ≈ Jonh Smith – J. Random Hacker ≈ James Random Hacker • Many well-known algorithms have been developed – no one best choice, unfortunately – some are eager to merge, others less so • Many are computationally expensive – O(n2) where n is length of string is common – this gets very costly when number of record pairs to compare is already high 11
  • 12. Existing tools • Commercial tools – big, sophisticated, and expensive – seem to follow the same principles – presumably also effective • Open source tools – generally made by and for statisticians – nice user interfaces and rich configurability – advanced maths – architecture often not as flexible as it could be 12
  • 14. Context • Doing a project for Hafslund where we integrate data from many sources • Entities are duplicated both inside systems and across systems Suppliers Customers Customers Customers Companies ERP CRM Billing 14
  • 15. Requirements • Must be flexible and configurable – no way to know in advance exactly what data we will need to deduplicate • Must scale to large data sets – CRM alone has 1.4 million customer records – that‟s 2 trillion comparisons with naïve approach • Must have an API – project uses SDshare protocol everywhere – must therefore be able to build connectors • Must be able to work incrementally – process data as it changes and update conclusions on the fly 15
  • 16. Reviewed existing tools... • ...but didn‟t find anything suitable for our needs • Therefore... 16
  • 17. Duke • Java deduplication engine – released as open source – http://code.google.com/p/duke/ • Does not use key approach – instead indexes data with Lucene – does Lucene searches to find potential matches • Still a work in progress, but – high performance, – several data sources and comparators, – being used for real in real projects, – flexible architecture 17
  • 18. How it works DataSource Extrac Map Clean t DataSource Extrac Map Clean Index Search Compare Link t DataSource Extrac Map Clean t Lucene index 18
  • 19. Cleaning of data Côte D’Ivoire cote d’ivoire Main St 113 113 main street Dec 25 1973 1973-12-25 • Used to normalize data from sources – necessary to get rid of differences that don‟t matter • Makes comparing much more effective – pluggable in Duke – utilities and components already provided 19
  • 20. Components Data sources Comparators • CSV • ExactComparator • JDBC • NumericComparator • Sparql • SoundexComparator • NTriples • TokenizedComparator • <plug in your own> • Levenshtein • JaroWinkler • Dice coefficient • <plug in your own> 20
  • 21. Features • XML syntax for configuration • Fairly complete command-line tool – with debugging support • API for embedding • Pluggable data sources, comparators, and cleaners • Framework for composing cleaners • Incremental processing • High performance 21
  • 22. Two modes ? ? Deduplication Record linkage 22
  • 23. Why probabilities? • Some tools use rules instead – if field1 and field2 match, then ... – if field1 and (field3 or field4) match, then ... • This is easy to understand, but – there are many, many combinations – doesn‟t take inexact matching into account – adding another field becomes a real pain 23
  • 24. Probabilities weigh all the evidence Field Record 1 Record 2 Probability Accum. Name acme inc acme inc 0.9 0.9 Assoc no 177477707 0.5 0.9 Zip code 9161 9161 0.6 0.931 Country norway norway 0.51 0.934 Address 1 mb 113 mailbox 0.3 0.857 113 Address 2 0.5 0.857 Probabilities combined with Bayes Theorem Probabilities reduced if values don’t match exactly 24
  • 25. Linking Mondial and DBpedia A real-world example 25 http://code.google.com/p/duke/wiki/LinkingCountries
  • 26. Finding properties to match • Need properties providing identity evidence • Matching on the properties in bold below • Extracted data to CSV for ease of use DBPEDIA MONDIAL Id http://dbpedia.org/resource/Samoa Id 17019 Name Samoa Name Western Samoa Founding date 1962-01-01 Independence 01 01 1962 Capital Apia Capital Apia, Samoa Currency Tala Population 214384 Area 2831 Area 2860 Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415 26
  • 27. Configuration – data sources <group> <group> <csv> <csv> <param name="input-file" value="dbpedia.csv"/> <param name="input-file" value="mondial.csv"/> <param name="header-line" value="false"/> <column name="id" property="ID"/> <column name="1" property="ID"/> <column name="country" <column name="2" cleaner="no.priv...examples.CountryNameCleaner" cleaner="no.priv...CountryNameCleaner" property="NAME"/> property="NAME"/> <column name="capital" <column name="3" cleaner="no.priv...LowerCaseNormalizeCleaner" property="AREA"/> property="CAPITAL"/> <column name="4" <column name="area" cleaner="no.priv...CapitalCleaner" property="AREA"/> property="CAPITAL"/> </csv> </csv> </group> </group> Using groups tells Duke that we are linking across two data sets, not deduplicating by comparing all records against all others 27
  • 28. Configuration – matching <schema> <threshold>0.65</threshold> Duke analyzes this setup and decides <property type="id"> only NAME and CAPITAL need to be <name>ID</name> searched on in Lucene. </property> <property> <name>NAME</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.3</low> <high>0.88</high> </property> <property> <name>AREA</name> <comparator>AreaComparator</comparator> <low>0.2</low> <object class="no.priv.garshol.duke.NumericComparator" <high>0.6</high> name="AreaComparator"> </property> <param name="min-ratio" value="0.7"/> <property> </object> <name>CAPITAL</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.4</low> <high>0.88</high> </property> </schema> 28
  • 29. Result • Correct links found: 206 / 217 (94.9%) • Wrong links found: 0 / 12 (0.0%) • Unknown links found: 0 • Percent of links correct 100.0%, wrong 0.0%, unknown 0.0% • Records with no link: 25 • Precision 100.0%, recall 94.9%, f-number 0.974 29
  • 30. Examples Field DBpedia Mondial Field DBpedia Mondial Name albania albania Name kazakhstan kazakstan Area 28748 28750 Area 2724900 2717300 Capital tirana tirane Capital astana almaty Probability 0.980 Probability 0.838 Field DBpedia Mondial Field DBpedia Mondial Name côte d'ivoire cote divoire Name grande comore comoros Area 322460 322460 Area 1148 2170 Capital yamoussoukro yamoussoukro Capital moroni moroní Probability 0.975 Probability 0.440 Field DBpedia Mondial Field DBpedia Mondial Name samoa western samoa Name serbia serbia and mont Area 2831 2860 Area 102350 88361 Capital apia apia Capital sarajevo sarajevo Probability 0.824 Probability 0.440 30
  • 31. Western Samoa or American Samoa? Field DBpedia Mondial Probability Name samoa western samoa 0.3 Area 2831 2860 0.6 Capital apia apia 0.88 Probability 0.824 Field DBpedia Mondial Probability Name samoa american samoa 0.3 Area 2831 199 0.4 Capital apia pago pago 0.4 Probability 0.067 31
  • 32. An example of failure Field DBpedia Mondial • Duke doesn‟t find this match Name kazakhstan kazakstan Area 2724900 2717300 – no tokens matching exactly Capital astana almaty – Lucene search finds nothing Probability 0.838 • Detailed comparison gives correct result – the problem is Lucene search • Lucene does have Levenshtein search, but – in Lucene 3.x it‟s very slow – therefore not enabled now – thinking of adding option to enable where needed – Lucene 4.x will fix the performance problem 32
  • 33. The effects of value matching Case Precision Recall F Optimal 100% 94.9% 97.4% Cleaning, 99.3% 73.3% 84.4% exact compare No cleaning, 100% 93.4% 96.6% good compare No cleaning, 99.3% 70.9% 82.8% exact compare Cleaning, 97.6% 96.7% 97.2% JaroWinkler Cleaning, 95.9% 97.2% 96.6% Soundex 33
  • 34. Usage at Hafslund Duke in real life 34
  • 35. The big picture DUPLICATES! SDshare Public SDshare 360 Search SDshare SDshare Virtuoso ERP engine (RDF) CRM SDshare Billing Duke SDshare contains owl:sameAs and haf:possiblySameAs Can override here LinkDB 35
  • 36. Experiences so far • Incremental processing works fine – links added and retracted as data changes • Performance not an issue at all – despite there being ~1.4 million records – requires a little bit of tuning, however • Matching works well, but not perfect – data are very noisy and messy – biggest issue is clusters caused by generic values – also, matching is hard 36
  • 38. Levenshtein • Also known as edit distance • Measures the number of edit operations necessary to turn s1 into s2 • Edit operations are – insert a character – remove a character – substitute a character • Complexity – O(n2) – naïve algorithm – O(n + d2) – where d = distance (with optimizations) 38
  • 39. Levenshtein example • Levenshtein -> Löwenstein – Levenstein (remove „h‟) – Lövenstein (substitute „ö‟) – Löwenstein (substitute „w‟) • Edit distance = 3 39
  • 40. Weighted Levenshtein • Not all edit operations are equal – substituting “i” for “e” is a smaller edit than substituting “o” for “k” • Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0 • Weights depend on context – language, for example • Slower, because some Levenshtein optimizations do not apply with weights 40
  • 41. Jaro-Winkler • Developed at the US Bureau of the Census • For name comparisons – not well suited to long strings – best if given name/surname are separated • Exists in a few variants – originally proposed by Winkler – then modified by Jaro – a few different versions of modifications etc 41
  • 42. Jaro-Winkler definition • Formula: – m = number of matching characters – t = number of transposed characters • A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away • Levenshtein ~ Löwenstein = 0.8 • Axel ~ Aksel = 0.783 42
  • 44. Soundex • A coarse schema for matching names by sound – produces a key from the name – names match if key is the same • In common use in many places – Nav‟s person register uses it for search – built-in in many databases – ... 44
  • 46. Examples • soundex(“Axel”) = „A240‟ • soundex(“Aksel”) = „A240‟ • soundex(“Levenshtein”) = „L523‟ • soundex(“Löwenstein”) = „L152‟ 46
  • 47. Metaphone • Developed by Lawrence Philips • Similar to Soundex, but much more complex – both more accurate and more sensitive • Developed further into Double Metaphone • Metaphone 3.0 also exists, but only available commercially 47
  • 48. Metaphone examples • metaphone(“Axel”) = „AKSL‟ • metaphone(“Aksel”) = „AKSL‟ • metaphone(“Levenshtein”) = „LFNX‟ • metaphone(“Löwenstein”) = „LWNS‟ 48
  • 49. Norphone • I‟m working on a Norwegian alternative to Metaphone – designed to handle phonetic rules in Norweigan names – for example, “ch” = “k” • Work in progress – http://code.google.com/p/duke/source/browse/src/ main/java/no/priv/garshol/duke/comparators/Norp honeComparator.java 49
  • 50. Dice coefficient • A similarity measure for sets – set can be tokens in a string – or characters in a string • Formula: 50
  • 51. TFIDF • Compares strings as sets of tokens – a la Dice coefficient • However, takes frequency of tokens in corpus into account – this matches how we evaluate matches mentally • Has done well in evaluations – however, can be difficult to evaluate – results will change as corpus changes 51
  • 52. More comparators • Smith-Waterman – originated in DNA sequencing • Q-grams distance – breaks string into sets of pieces of q characters – then does set similarity comparison • Monge-Elkan – similar to Smith-Waterman, but with affine gap distances – has done very well in evaluations – costly to evaluate • Many, many more – ... 52
  • 54. Why performance matters • Performance really is a major issue – O(n2) behaviour – slow comparators • An example – linking 840‟000 authors from Deichmann library with 200‟000 person records from DBpedia – with naïve in-memory backend this takes ~4 secs per record in second group – that‟s 9.5 days to do the whole job... 54
  • 55. Bigger data So... 9.5 days for ~400,000 records Time to process increases with square of record count These guys can expect to spend 585 years... 55
  • 56. Obvious solution #1 • Where is the program spending its time? – mostly in Levenshtein string comparisons – these actually take a while • Solution – switch from n x n array to single n2 array (30% off) – break off comparison if we‟re going to go over difference limit, anyway (another 30% off) • Overall speed improvement ~50% – so, from 9.5 days to 4.25 days – or, from 585 years to 293 years 56
  • 57. Obvious solution #2 • Parallel computing – indexing takes just a few minutes (for small case) – record comparisons are independent of each other – so, spin up different computers and let them work in parallell • Does it work? – well, let‟s say we use 20 nodes – enough to cost a bit, and give us some admin work – not too many to cause communication issues – expect speed-up by a factor of 10 • That gives us – from 4.5 days to 10 hours Didn’t actually do this... – from 293 years to 29 years 57
  • 58. Obvious solution #3 • Improve the algorithm! – that is, use the Lucene backend – now we only compare records matched by searches – dramatically cuts number of record pairs • But does it work? – Deichmann case now runs in 2 hours (single machine) – performance still not linear – Mexican runtime therefore probably still too long (hence their email) 58
  • 59. Final solution • Searches can still return huge numbers of potential candidates – most relevant ones at the top – so, we can configure cutoffs (by relevance or number) • Setting – min-relevance: 0.99 – max-search-hits: 10 – runs Deichmann in 3 minutes (98.2% accuracy) – performance now close to linear – should work for the Mexicans, too 59
  • 60. Next step • We can do some more tweaking with Lucene – can use boosting to improve relevance – then can cut more records without losing recall – there are limits, however • So next step may well be parallel computation – should be fun  60
  • 61. Comments/questions? Slides will appear on http://slideshare.net/larsga 61