SlideShare uma empresa Scribd logo
1 de 44
Rogan Hamby South Carolina State Library rhamby@statelibrary.sc.gov Shasta Brewer  York County Library  Shasta.brewer@yclibrary.net 10% Wrong, 90% Done A practical approach to bibliographic de-duplication.
Made Up Words When I say ‘deduping’ I mean  ‘MARC record de-duplication’
The Melting Pot We were ten library systems with no standard source of MARC records.  We came from five ILSes.  Each had its own needs and      workflow.  The MARC records reflected that.
Over 2,000,000 Records Ten library systems joined in three waves.
Early Effort During each wave we ran a dedupingscript. The script functioned as designed, however its matches were too few for our needs.
100% Accurate It had a very high standard for creating matches. No bad merges were created.
Service Issue When a patron searched the catalog it was messy.
This caused problems with searching and placing holds.
It’s All About the TCNs Why was this happening? Because identical items were divided among multiple similar bib records with distinct fingerprints due to coming from multiple sources.
Time for the Cleaning Gloves In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of the deduping process. In true open source spirit we decided to roll our own solution and start cleaning up the database.
Finger Printing  Finger printing is identifying a unique MARC record by its properties.
As finger printing identifies unique records it was of limited use since our records came from many sources.
A Disclaimer The initial deduping, as designed, was very accurate.  It emphasized avoiding imprecise matches.  We decided that we had different priorities and were willing to make compromises.
MARC Crimes Unit We decided to go past finger printing and build profiles based on broad MARC attributes.
Project Goals Improve Searching Faster Holds Filling
The Team Shasta Brewer – York County Lynn Floyd – Anderson County Rogan Hamby – Florence County / State Library
The Mess 2,048,936 bib records
On Changes During the development process a lot changed from early discussion to implementation. We weighed decisions heavily on the side of needing to have a significant and practical impact on the catalog.  I watch the ripples change their size / But never leave the stream  ,[object Object],[object Object]
Tilting at Windmills We refused to believe that the highest priority for deduping should be avoiding bad matches. The highest priority is creating the maximum positive impact on the catalog.  Many said we were a bit mad.  Fortunately, we took it as a complement.
We ran extensive reports to model the bib data.  A risky and non-conventional model was proposed.  Although we kept trying other models, the benefit of large matches using the risky model made it too compelling to discard.
Why not just title and ISBN? We did socialize this idea.  And everyone did think we were nuts.
Method to the Madness Title and ISBN are the most commonly populated fields for identifying unique items.  Records with ISBNs and Titles accounted for over 60% of the bib records in the system.  The remainder included SUDOCs, ISSNs, pre-ISBN items and some that were just plain garbage.
Geronimo We decided  to do it!
What Was Left Behind Records without a valid ISBN. Records without any ISBN (serials, etc..). Pre-Cat, stubs records, etc… Pure Junk Records. And other things that would require such extraordinarily convoluted matching that it exceeded the risk even beyond our pain threshold for a first run.
We estimated based on modeling a conservative ~300,000 merges or about 25% of our ISBNs.
The Wisdom of Crowds Conventional wisdom said that MARC could not be generalized because of unique information in the records. We were taking risks and very aware of it but the need to create a large impact on our database drove us to disregard friendly warnings.
An Imperfect World We knew that we would miss things that could potentially be merged. We knew that we would create some bad merges.  10% wrong to get it 90% done.
Next Step … Normalization With matching decided we needed to normalize the data.  This was done to copies of the production MARC records and that used to make lists.    Normalization is needed because of variability in how data was entered.  It allows us to get the most possible matches based on data.
Normalization Details We  normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles, redacted fields, 10 digit ISBNs as 13 digit and lots, lots more. This was not done to permanent records but to copies used to make the lists.
Weighting Finally, we had to weight the records that have been matched to determine which should be the record to keep.   To do this each bib record was given a score to profile its quality.
The Weighting Criteria We looked at the presence, length, and number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX, 9XX and 59X fields to manipulate, add to, subtract from, bludgeon, poke and eventually determine a 24 digit number that would profile the quality of a bib record.
The Merging Once the weighing is done the highest scored record in each group is made the master record, the copies and holds from the others moved to it and those bibs marked deleted.
Checking the Weight  We did a report of items that would group based on our criteria and had staff do sample manual checks to see if they could live with the dominant record.  We collectively checked ~1,000 merges.
90 % of the time we felt the highest quality record was selected as the dominant.  More than 9% of the time an acceptable record was selected.  In a very few instances human errors in the record made the system create a bad profile, but never an actual bad dominant record.
The Coding We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production).  Galen Charlton was our primary contact in this. In addition to his coding of the algorithm he also provided input about additional criteria to include in the weighting and normalization.
Test Server  Once run on the test server we took our new batches of records and broke them into 50,000 record chunks.  We then gave those chunks to member libraries and had them do random samples for five days.
Fixed As We Went Non-Standard Cataloging (ongoing) 13 digit ISBNs normalizing as 10 digit ISBNs.   Identified many parts of item sets as issues.  Shared title publications with different formats.   The order of the ISBNs.  Kits.
In Conclusion We don’t know how many bad matches were formed. Total discovered after a year is less than 200. We were able to purge 326,098 bib records or about 27% of our ISBN based collection.
Evaluation The catalog is visibly cleaner.   The cost per bib record  was 1.5 cents. Absolutely successful!
Future We want to continue to refine it (eg. 020 subfield z).  There are still problems that need to be cleaned up in the catalog – some manually and some by automation. Raising Standards.
New libraries that have joined SCLENDs use our deduping algorithm not the old one. It has continued to be successful.
Open Sourcing the Solution We are releasing the algorithm under the Creative Commons Attribution Non-Commercial license. We are releasing the SQL code under the GPL.
Questions?
10% Wrong 90% Wrong

Mais conteúdo relacionado

Semelhante a 10% Wrong 90% Wrong

Making the most of OCLC's Reclamation Batchload
Making the most of OCLC's Reclamation BatchloadMaking the most of OCLC's Reclamation Batchload
Making the most of OCLC's Reclamation Batchload
Benjamin Ferguson
 
Closing the Findability Gap: 8 better practices from information architecture
Closing the Findability Gap: 8 better practices from information architectureClosing the Findability Gap: 8 better practices from information architecture
Closing the Findability Gap: 8 better practices from information architecture
Louis Rosenfeld
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
RAHUL126667
 

Semelhante a 10% Wrong 90% Wrong (20)

Code4Lib Keynote 2011
Code4Lib Keynote 2011Code4Lib Keynote 2011
Code4Lib Keynote 2011
 
Teaching wild horses to sing: Harmonizing the deluge of electronic serials
Teaching wild horses to sing: Harmonizing the deluge of electronic serialsTeaching wild horses to sing: Harmonizing the deluge of electronic serials
Teaching wild horses to sing: Harmonizing the deluge of electronic serials
 
Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...
Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...
Carpenter, McCraken, Ventimiglia, Noonan, and Walker "KBART and the OpenURL: ...
 
Making the most of OCLC's Reclamation Batchload
Making the most of OCLC's Reclamation BatchloadMaking the most of OCLC's Reclamation Batchload
Making the most of OCLC's Reclamation Batchload
 
RDA - an updated overview
RDA -  an updated overviewRDA -  an updated overview
RDA - an updated overview
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
Statisics for hackers
Statisics for hackersStatisics for hackers
Statisics for hackers
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
Libraries & Tech for Good, 11 July 2016 (with notes)
Libraries & Tech for Good, 11 July 2016 (with notes)Libraries & Tech for Good, 11 July 2016 (with notes)
Libraries & Tech for Good, 11 July 2016 (with notes)
 
BigData and Algorithms - LA Algorithmic Trading
BigData and Algorithms - LA Algorithmic TradingBigData and Algorithms - LA Algorithmic Trading
BigData and Algorithms - LA Algorithmic Trading
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
What’s wrong with research papers - and (how) can we fix it?
What’s wrong with research papers -  and (how) can we fix it?What’s wrong with research papers -  and (how) can we fix it?
What’s wrong with research papers - and (how) can we fix it?
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
 
Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed
 
Closing the Findability Gap: 8 better practices from information architecture
Closing the Findability Gap: 8 better practices from information architectureClosing the Findability Gap: 8 better practices from information architecture
Closing the Findability Gap: 8 better practices from information architecture
 
SharePoint Metadata - Simple to Sublime
SharePoint Metadata - Simple to SublimeSharePoint Metadata - Simple to Sublime
SharePoint Metadata - Simple to Sublime
 
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
BUS216 Exam #3 Review – SP14  1  1. In order to ha.docxBUS216 Exam #3 Review – SP14  1  1. In order to ha.docx
BUS216 Exam #3 Review – SP14 1 1. In order to ha.docx
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
Essay For Books
Essay For BooksEssay For Books
Essay For Books
 

10% Wrong 90% Wrong

  • 1. Rogan Hamby South Carolina State Library rhamby@statelibrary.sc.gov Shasta Brewer York County Library Shasta.brewer@yclibrary.net 10% Wrong, 90% Done A practical approach to bibliographic de-duplication.
  • 2. Made Up Words When I say ‘deduping’ I mean ‘MARC record de-duplication’
  • 3. The Melting Pot We were ten library systems with no standard source of MARC records. We came from five ILSes. Each had its own needs and workflow. The MARC records reflected that.
  • 4. Over 2,000,000 Records Ten library systems joined in three waves.
  • 5. Early Effort During each wave we ran a dedupingscript. The script functioned as designed, however its matches were too few for our needs.
  • 6. 100% Accurate It had a very high standard for creating matches. No bad merges were created.
  • 7. Service Issue When a patron searched the catalog it was messy.
  • 8. This caused problems with searching and placing holds.
  • 9. It’s All About the TCNs Why was this happening? Because identical items were divided among multiple similar bib records with distinct fingerprints due to coming from multiple sources.
  • 10. Time for the Cleaning Gloves In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of the deduping process. In true open source spirit we decided to roll our own solution and start cleaning up the database.
  • 11. Finger Printing Finger printing is identifying a unique MARC record by its properties.
  • 12. As finger printing identifies unique records it was of limited use since our records came from many sources.
  • 13. A Disclaimer The initial deduping, as designed, was very accurate. It emphasized avoiding imprecise matches. We decided that we had different priorities and were willing to make compromises.
  • 14. MARC Crimes Unit We decided to go past finger printing and build profiles based on broad MARC attributes.
  • 15. Project Goals Improve Searching Faster Holds Filling
  • 16. The Team Shasta Brewer – York County Lynn Floyd – Anderson County Rogan Hamby – Florence County / State Library
  • 17. The Mess 2,048,936 bib records
  • 18.
  • 19. Tilting at Windmills We refused to believe that the highest priority for deduping should be avoiding bad matches. The highest priority is creating the maximum positive impact on the catalog. Many said we were a bit mad. Fortunately, we took it as a complement.
  • 20. We ran extensive reports to model the bib data. A risky and non-conventional model was proposed. Although we kept trying other models, the benefit of large matches using the risky model made it too compelling to discard.
  • 21. Why not just title and ISBN? We did socialize this idea. And everyone did think we were nuts.
  • 22. Method to the Madness Title and ISBN are the most commonly populated fields for identifying unique items. Records with ISBNs and Titles accounted for over 60% of the bib records in the system. The remainder included SUDOCs, ISSNs, pre-ISBN items and some that were just plain garbage.
  • 23. Geronimo We decided to do it!
  • 24. What Was Left Behind Records without a valid ISBN. Records without any ISBN (serials, etc..). Pre-Cat, stubs records, etc… Pure Junk Records. And other things that would require such extraordinarily convoluted matching that it exceeded the risk even beyond our pain threshold for a first run.
  • 25. We estimated based on modeling a conservative ~300,000 merges or about 25% of our ISBNs.
  • 26. The Wisdom of Crowds Conventional wisdom said that MARC could not be generalized because of unique information in the records. We were taking risks and very aware of it but the need to create a large impact on our database drove us to disregard friendly warnings.
  • 27. An Imperfect World We knew that we would miss things that could potentially be merged. We knew that we would create some bad merges. 10% wrong to get it 90% done.
  • 28. Next Step … Normalization With matching decided we needed to normalize the data. This was done to copies of the production MARC records and that used to make lists. Normalization is needed because of variability in how data was entered. It allows us to get the most possible matches based on data.
  • 29. Normalization Details We normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles, redacted fields, 10 digit ISBNs as 13 digit and lots, lots more. This was not done to permanent records but to copies used to make the lists.
  • 30. Weighting Finally, we had to weight the records that have been matched to determine which should be the record to keep. To do this each bib record was given a score to profile its quality.
  • 31. The Weighting Criteria We looked at the presence, length, and number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX, 9XX and 59X fields to manipulate, add to, subtract from, bludgeon, poke and eventually determine a 24 digit number that would profile the quality of a bib record.
  • 32. The Merging Once the weighing is done the highest scored record in each group is made the master record, the copies and holds from the others moved to it and those bibs marked deleted.
  • 33. Checking the Weight We did a report of items that would group based on our criteria and had staff do sample manual checks to see if they could live with the dominant record. We collectively checked ~1,000 merges.
  • 34. 90 % of the time we felt the highest quality record was selected as the dominant. More than 9% of the time an acceptable record was selected. In a very few instances human errors in the record made the system create a bad profile, but never an actual bad dominant record.
  • 35. The Coding We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production). Galen Charlton was our primary contact in this. In addition to his coding of the algorithm he also provided input about additional criteria to include in the weighting and normalization.
  • 36. Test Server Once run on the test server we took our new batches of records and broke them into 50,000 record chunks. We then gave those chunks to member libraries and had them do random samples for five days.
  • 37. Fixed As We Went Non-Standard Cataloging (ongoing) 13 digit ISBNs normalizing as 10 digit ISBNs. Identified many parts of item sets as issues. Shared title publications with different formats. The order of the ISBNs. Kits.
  • 38. In Conclusion We don’t know how many bad matches were formed. Total discovered after a year is less than 200. We were able to purge 326,098 bib records or about 27% of our ISBN based collection.
  • 39. Evaluation The catalog is visibly cleaner. The cost per bib record was 1.5 cents. Absolutely successful!
  • 40. Future We want to continue to refine it (eg. 020 subfield z). There are still problems that need to be cleaned up in the catalog – some manually and some by automation. Raising Standards.
  • 41. New libraries that have joined SCLENDs use our deduping algorithm not the old one. It has continued to be successful.
  • 42. Open Sourcing the Solution We are releasing the algorithm under the Creative Commons Attribution Non-Commercial license. We are releasing the SQL code under the GPL.