SlideShare a Scribd company logo
1 of 27
TAR Versus Keyword
Challenge
Private and Confidential – Copyright 2019
Prevalence (Richness)
• Percentage of population that is relevant.
• Important when choosing methodology.
Private and Confidential – Copyright 2019
Recall
• Percentage of relevant docs found.
• Defensibility.
Private and Confidential – Copyright 2019
Precision
• Percentage of retrieved docs that are relevant.
• Related to cost (review effort).
• 1/P = average docs reviewed per relevant doc found.
Private and Confidential – Copyright 2019
Precision-Recall Curve
Private and Confidential – Copyright 2019
Meaningful Comparison of Systems
Fair Approaches:
• Equal defensibility (recall), compare cost.
• Equal cost, compare defensibility (recall).
At least one system should achieve reasonable recall.
Private and Confidential – Copyright 2019
Bad Metric: Accuracy
• Percentage of predictions that are right.
• Makes bad systems look good.
Private and Confidential – Copyright 2019
Bad Metric: F1 Score
• F1 = 2*P*R / (P + R)
• Between P and R, closer to the smaller one.
Private and Confidential – Copyright 2019
Are Research Results Relevant?
• Many studies aren’t focused on e-discovery.
• Appropriate metrics used?
• Reasonable recall achieved?
• Realistic data set?
Private and Confidential – Copyright 2019
Keyword Search vs. TAR
Search rules for the challenge (due to software limitations):
• No phrase search, proximity search, wildcards, or stemming
• Keywords are not case-sensitive
• Boolean operators must be upper case
• Weights (positive and negative) are OK. Default weight is 1.
• Example: microsoft^2.5 OR (windows AND NOT house)^1.2 OR software
Topics:
• Law: existing law, excluding politics or proposed new law
• Medical: business-oriented (not scientific) articles about the medical industry
• Biology: mainstream science articles (not medical treatment)
Submission: http://clustify.com/query
Analysis of final results will be posted at: http://blog.clustify.com
Private and Confidential – Copyright 2019
Keyword Search on Steroids
scientists^1000 OR gene^990 OR genes^973 OR protein^856 OR proteins^804 OR biotechnology^774 OR
cells^712 OR biology^708 OR dna^669 OR function^660 OR researchers^603 OR cell^600 OR human^543
OR expression^516 OR molecular^496 OR experiments^482 OR genetic^464 OR drugs^460 OR biotech^448
OR population^441 OR mammalian^427 OR development^411 OR sequence^404 OR investigators^374 OR
novel^368 OR disease^361 OR wild^361 OR pharmaceutical^360 OR reagents^349 OR adult^349 OR
scientific^345 OR island^344 OR antibody^335 OR rapid^328 OR synthesis^324 OR mouse^323 OR
…
OR reactions^-252 OR war^-252 OR populations^-264 OR computer^-283 OR effects^-289 OR optical^-291
OR electronic^-293 OR treatment^-296 OR risk^-310 OR society^-323 OR table^-344 OR learning^-346 OR
tests^-412
Private and Confidential – Copyright 2019
Finding Word Weights
Private and Confidential – Copyright 2019
Training / Control Set Animation
Private and Confidential – Copyright 2019
Keyword Search Strategies
Similar to TAR 1.0 (SPL):
• Review a random sample of docs.
• Examine docs to find query keywords.
• Repeat until query improvement is minimal.
• Hard when prevalence is low.
Similar to TAR 2.0 (CAL):
• Create a query.
• Review top docs from query.
• Adjust query to add keywords from relevant docs and to suppress non-relevant docs.
• Repeat until can’t find any more relevant docs.
• Good when prevalence is low, but is it robust?
Private and Confidential – Copyright 2019
TAR 2.0 Robust? - Weak Seed
Private and Confidential – Copyright 2019
TAR 2.0 Robust? – Wrong Seed
Private and Confidential – Copyright 2019
TAR 2.0 Robust? – Disjoint Relevance
Private and Confidential – Copyright 2019
Toy Example Illustrating Workflows
Private and Confidential – Copyright 2019
TAR 1.0
Private and Confidential – Copyright 2019
TAR 2.0
Private and Confidential – Copyright 2019
TAR 3.0
Private and Confidential – Copyright 2019
Review Effort (All Candidates Reviewed)
Private and Confidential – Copyright 2019
Review Effort (No Candidates Reviewed)
Private and Confidential – Copyright 2019
Beyond Keywords
• Use meta-data.
• Feature engineering.
• Adjacent word pairs instead of single words.
• Non-linear relevance boundary.
• Transformations to handle synonyms, etc. (LSA, word2vec, etc.)
Private and Confidential – Copyright 2019
Tips
• Think of TAR as a more systematic way to do keywords, plus more.
• Beware of keyword search culling before applying TAR – many relevant docs probably lost.
• Use the right performance metrics.
• Choose the right TAR workflow for the situation.
Private and Confidential – Copyright 2019
Misleading Metrics and Irrelevant Research (Accuracy and F1)
https://blog.cluster-text.com/2018/12/12/misleading-metrics-and-irrelevant-research-accuracy-and-f1/
The Single Seed Hypothesis
https://blog.cluster-text.com/2015/04/25/the-single-seed-hypothesis/
TAR 3.0 Performance
https://blog.cluster-text.com/2016/01/28/tar-3-0-performance/
References
Thank you for
joining us!

More Related Content

What's hot (6)

4. search technique jun2012
4. search technique jun20124. search technique jun2012
4. search technique jun2012
 
BUSI 3460U - Fall 2010 [complete]
BUSI 3460U - Fall 2010 [complete]BUSI 3460U - Fall 2010 [complete]
BUSI 3460U - Fall 2010 [complete]
 
Engl1101 Spring 2013-- Nagel
Engl1101 Spring 2013-- NagelEngl1101 Spring 2013-- Nagel
Engl1101 Spring 2013-- Nagel
 
Acc 575 week 9 assignment 1 federal taxes
Acc 575 week 9 assignment 1 federal taxesAcc 575 week 9 assignment 1 federal taxes
Acc 575 week 9 assignment 1 federal taxes
 
Find it Free & Fast: Reliable Websites for Internet Legal Research
Find it Free & Fast: Reliable Websites for Internet Legal ResearchFind it Free & Fast: Reliable Websites for Internet Legal Research
Find it Free & Fast: Reliable Websites for Internet Legal Research
 
ATTR 2601
ATTR 2601ATTR 2601
ATTR 2601
 

Similar to TAR versus Keyword Challenge

II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text MiningII-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
Dr. Haxel Consult
 
CJUS 740Discussion Assignment InstructionsThe student will p
CJUS 740Discussion Assignment InstructionsThe student will pCJUS 740Discussion Assignment InstructionsThe student will p
CJUS 740Discussion Assignment InstructionsThe student will p
VinaOconner450
 
Caterpillar Confidential GreenResearch Process• Week .docx
Caterpillar Confidential GreenResearch Process• Week .docxCaterpillar Confidential GreenResearch Process• Week .docx
Caterpillar Confidential GreenResearch Process• Week .docx
keturahhazelhurst
 
Practical Research Planning and DesignTwelfth Edition
Practical Research Planning and DesignTwelfth EditionPractical Research Planning and DesignTwelfth Edition
Practical Research Planning and DesignTwelfth Edition
TaunyaCoffman887
 
RES724 v6Observation GuideRES724 v6Page 2 of 2Analysis
RES724 v6Observation GuideRES724 v6Page 2 of 2AnalysisRES724 v6Observation GuideRES724 v6Page 2 of 2Analysis
RES724 v6Observation GuideRES724 v6Page 2 of 2Analysis
anitramcroberts
 

Similar to TAR versus Keyword Challenge (20)

II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text MiningII-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
 
TAR call to arms
TAR call to armsTAR call to arms
TAR call to arms
 
Discovery at Sea: Complex Searching in Ipro's Cloud Solution
Discovery at Sea:  Complex Searching in Ipro's Cloud SolutionDiscovery at Sea:  Complex Searching in Ipro's Cloud Solution
Discovery at Sea: Complex Searching in Ipro's Cloud Solution
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
FAIR for the future: embracing all things data
FAIR for the future: embracing all things dataFAIR for the future: embracing all things data
FAIR for the future: embracing all things data
 
CJUS 740Discussion Assignment InstructionsThe student will p
CJUS 740Discussion Assignment InstructionsThe student will pCJUS 740Discussion Assignment InstructionsThe student will p
CJUS 740Discussion Assignment InstructionsThe student will p
 
Caterpillar Confidential GreenResearch Process• Week .docx
Caterpillar Confidential GreenResearch Process• Week .docxCaterpillar Confidential GreenResearch Process• Week .docx
Caterpillar Confidential GreenResearch Process• Week .docx
 
Practical Research Planning and DesignTwelfth Edition
Practical Research Planning and DesignTwelfth EditionPractical Research Planning and DesignTwelfth Edition
Practical Research Planning and DesignTwelfth Edition
 
Biomedical Research as an Open Digital Enterprise
Biomedical Research as an Open Digital EnterpriseBiomedical Research as an Open Digital Enterprise
Biomedical Research as an Open Digital Enterprise
 
Data Science Skills Study 2019 by AIM And Imarticus Learning
Data Science Skills Study 2019 by AIM And Imarticus LearningData Science Skills Study 2019 by AIM And Imarticus Learning
Data Science Skills Study 2019 by AIM And Imarticus Learning
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
RES724 v6Observation GuideRES724 v6Page 2 of 2Analysis
RES724 v6Observation GuideRES724 v6Page 2 of 2AnalysisRES724 v6Observation GuideRES724 v6Page 2 of 2Analysis
RES724 v6Observation GuideRES724 v6Page 2 of 2Analysis
 
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical LiteratureII-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
 
Predicting potential electronic serials use
Predicting potential electronic serials usePredicting potential electronic serials use
Predicting potential electronic serials use
 
effective data sharing for a learning healthcare system
effective data sharing for a learning healthcare systemeffective data sharing for a learning healthcare system
effective data sharing for a learning healthcare system
 
Skeptics no more building your legal team's confidence in using ipro analytics
Skeptics no more building your legal team's confidence in using ipro analyticsSkeptics no more building your legal team's confidence in using ipro analytics
Skeptics no more building your legal team's confidence in using ipro analytics
 
Öppen data och forskningens genomslag
Öppen data och forskningens genomslagÖppen data och forskningens genomslag
Öppen data och forskningens genomslag
 
Discovery at Sea - Complex searching in Pro's Cloud Solution
Discovery at Sea - Complex searching in Pro's Cloud SolutionDiscovery at Sea - Complex searching in Pro's Cloud Solution
Discovery at Sea - Complex searching in Pro's Cloud Solution
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data Challenges
 
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott...
 

More from Ipro Tech

More from Ipro Tech (20)

Build a Blockchain
Build a BlockchainBuild a Blockchain
Build a Blockchain
 
Financials and eDiscovery - A Primer for Non-Accountants
Financials and eDiscovery - A Primer for Non-AccountantsFinancials and eDiscovery - A Primer for Non-Accountants
Financials and eDiscovery - A Primer for Non-Accountants
 
In House v. Independent Hot Seat Panel
In House v. Independent Hot Seat PanelIn House v. Independent Hot Seat Panel
In House v. Independent Hot Seat Panel
 
Project Management in Electronic Discovery
Project Management in Electronic DiscoveryProject Management in Electronic Discovery
Project Management in Electronic Discovery
 
Build an Engaging Social Media Profile
Build an Engaging Social Media ProfileBuild an Engaging Social Media Profile
Build an Engaging Social Media Profile
 
Life of a GB: Where Is My Data Going and How Can I Get It There Faster?
Life of a GB: Where Is My Data Going and How Can I Get It There Faster?Life of a GB: Where Is My Data Going and How Can I Get It There Faster?
Life of a GB: Where Is My Data Going and How Can I Get It There Faster?
 
What’s New in Ipro for enterprise?
What’s New in Ipro for enterprise?What’s New in Ipro for enterprise?
What’s New in Ipro for enterprise?
 
Let’s Talk About the Ipro Platform
Let’s Talk About the Ipro PlatformLet’s Talk About the Ipro Platform
Let’s Talk About the Ipro Platform
 
Double Down: Migrating Data from Desktop to Enterprise (and back)
Double Down: Migrating Data from Desktop to Enterprise (and back)Double Down: Migrating Data from Desktop to Enterprise (and back)
Double Down: Migrating Data from Desktop to Enterprise (and back)
 
What’s in Your Workflow?
What’s in Your Workflow?What’s in Your Workflow?
What’s in Your Workflow?
 
Can you Take the Heat of the Hot Seat?
Can you Take the Heat of the Hot Seat?Can you Take the Heat of the Hot Seat?
Can you Take the Heat of the Hot Seat?
 
Diving Deeper into Networking & Local Options in TrialDirector 360
Diving Deeper into Networking & Local Options in TrialDirector 360Diving Deeper into Networking & Local Options in TrialDirector 360
Diving Deeper into Networking & Local Options in TrialDirector 360
 
Presenter’s Advantage: Preparing Exhibits in TrialDirector 360
Presenter’s Advantage: Preparing Exhibits in TrialDirector 360Presenter’s Advantage: Preparing Exhibits in TrialDirector 360
Presenter’s Advantage: Preparing Exhibits in TrialDirector 360
 
TrialDirector 360: Beyond the Courtroom
TrialDirector 360: Beyond the CourtroomTrialDirector 360: Beyond the Courtroom
TrialDirector 360: Beyond the Courtroom
 
Proactive v. Reactive Trial Presentations
Proactive v. Reactive Trial PresentationsProactive v. Reactive Trial Presentations
Proactive v. Reactive Trial Presentations
 
Deposition Management: Utilizing TrialDirector 360 to Prepare your Designatio...
Deposition Management: Utilizing TrialDirector 360 to Prepare your Designatio...Deposition Management: Utilizing TrialDirector 360 to Prepare your Designatio...
Deposition Management: Utilizing TrialDirector 360 to Prepare your Designatio...
 
Flexible Processing for Dynamic Workflows
Flexible Processing for Dynamic WorkflowsFlexible Processing for Dynamic Workflows
Flexible Processing for Dynamic Workflows
 
Search Faceoff: Advanced v. Visual
Search Faceoff: Advanced v. VisualSearch Faceoff: Advanced v. Visual
Search Faceoff: Advanced v. Visual
 
TAR: Beginning to End
TAR: Beginning to EndTAR: Beginning to End
TAR: Beginning to End
 
Repro with Ipro: Simplifying your Imaging Workflows
Repro with Ipro: Simplifying your Imaging WorkflowsRepro with Ipro: Simplifying your Imaging Workflows
Repro with Ipro: Simplifying your Imaging Workflows
 

Recently uploaded

一比一原版埃克塞特大学毕业证如何办理
一比一原版埃克塞特大学毕业证如何办理一比一原版埃克塞特大学毕业证如何办理
一比一原版埃克塞特大学毕业证如何办理
Airst S
 
一比一原版伦敦南岸大学毕业证如何办理
一比一原版伦敦南岸大学毕业证如何办理一比一原版伦敦南岸大学毕业证如何办理
一比一原版伦敦南岸大学毕业证如何办理
Airst S
 
一比一原版赫尔大学毕业证如何办理
一比一原版赫尔大学毕业证如何办理一比一原版赫尔大学毕业证如何办理
一比一原版赫尔大学毕业证如何办理
Airst S
 
一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理
一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理
一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理
Airst S
 
PowerPoint - Legal Citation Form 1 - Case Law.pptx
PowerPoint - Legal Citation Form 1 - Case Law.pptxPowerPoint - Legal Citation Form 1 - Case Law.pptx
PowerPoint - Legal Citation Form 1 - Case Law.pptx
ca2or2tx
 
Contract law. Indemnity
Contract law.                     IndemnityContract law.                     Indemnity
Contract law. Indemnity
mahikaanand16
 
一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理
一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理
一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理
A AA
 

Recently uploaded (20)

Smarp Snapshot 210 -- Google's Social Media Ad Fraud & Disinformation Strategy
Smarp Snapshot 210 -- Google's Social Media Ad Fraud & Disinformation StrategySmarp Snapshot 210 -- Google's Social Media Ad Fraud & Disinformation Strategy
Smarp Snapshot 210 -- Google's Social Media Ad Fraud & Disinformation Strategy
 
一比一原版埃克塞特大学毕业证如何办理
一比一原版埃克塞特大学毕业证如何办理一比一原版埃克塞特大学毕业证如何办理
一比一原版埃克塞特大学毕业证如何办理
 
Performance of contract-1 law presentation
Performance of contract-1 law presentationPerformance of contract-1 law presentation
Performance of contract-1 law presentation
 
Independent Call Girls Pune | 8005736733 Independent Escorts & Dating Escorts...
Independent Call Girls Pune | 8005736733 Independent Escorts & Dating Escorts...Independent Call Girls Pune | 8005736733 Independent Escorts & Dating Escorts...
Independent Call Girls Pune | 8005736733 Independent Escorts & Dating Escorts...
 
一比一原版伦敦南岸大学毕业证如何办理
一比一原版伦敦南岸大学毕业证如何办理一比一原版伦敦南岸大学毕业证如何办理
一比一原版伦敦南岸大学毕业证如何办理
 
一比一原版赫尔大学毕业证如何办理
一比一原版赫尔大学毕业证如何办理一比一原版赫尔大学毕业证如何办理
一比一原版赫尔大学毕业证如何办理
 
IBC (Insolvency and Bankruptcy Code 2016)-IOD - PPT.pptx
IBC (Insolvency and Bankruptcy Code 2016)-IOD - PPT.pptxIBC (Insolvency and Bankruptcy Code 2016)-IOD - PPT.pptx
IBC (Insolvency and Bankruptcy Code 2016)-IOD - PPT.pptx
 
一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理
一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理
一比一原版(JCU毕业证书)詹姆斯库克大学毕业证如何办理
 
3 Formation of Company.www.seribangash.com.ppt
3 Formation of Company.www.seribangash.com.ppt3 Formation of Company.www.seribangash.com.ppt
3 Formation of Company.www.seribangash.com.ppt
 
PowerPoint - Legal Citation Form 1 - Case Law.pptx
PowerPoint - Legal Citation Form 1 - Case Law.pptxPowerPoint - Legal Citation Form 1 - Case Law.pptx
PowerPoint - Legal Citation Form 1 - Case Law.pptx
 
Analysis of R V Kelkar's Criminal Procedure Code ppt- chapter 1 .pptx
Analysis of R V Kelkar's Criminal Procedure Code ppt- chapter 1 .pptxAnalysis of R V Kelkar's Criminal Procedure Code ppt- chapter 1 .pptx
Analysis of R V Kelkar's Criminal Procedure Code ppt- chapter 1 .pptx
 
Contract law. Indemnity
Contract law.                     IndemnityContract law.                     Indemnity
Contract law. Indemnity
 
WhatsApp 📞 8448380779 ✅Call Girls In Nangli Wazidpur Sector 135 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Nangli Wazidpur Sector 135 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Nangli Wazidpur Sector 135 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Nangli Wazidpur Sector 135 ( Noida)
 
589308994-interpretation-of-statutes-notes-law-college.pdf
589308994-interpretation-of-statutes-notes-law-college.pdf589308994-interpretation-of-statutes-notes-law-college.pdf
589308994-interpretation-of-statutes-notes-law-college.pdf
 
8. SECURITY GUARD CREED, CODE OF CONDUCT, COPE.pptx
8. SECURITY GUARD CREED, CODE OF CONDUCT, COPE.pptx8. SECURITY GUARD CREED, CODE OF CONDUCT, COPE.pptx
8. SECURITY GUARD CREED, CODE OF CONDUCT, COPE.pptx
 
PPT- Voluntary Liquidation (Under section 59).pptx
PPT- Voluntary Liquidation (Under section 59).pptxPPT- Voluntary Liquidation (Under section 59).pptx
PPT- Voluntary Liquidation (Under section 59).pptx
 
Shubh_Burden of proof_Indian Evidence Act.pptx
Shubh_Burden of proof_Indian Evidence Act.pptxShubh_Burden of proof_Indian Evidence Act.pptx
Shubh_Burden of proof_Indian Evidence Act.pptx
 
How do cyber crime lawyers in Mumbai collaborate with law enforcement agencie...
How do cyber crime lawyers in Mumbai collaborate with law enforcement agencie...How do cyber crime lawyers in Mumbai collaborate with law enforcement agencie...
How do cyber crime lawyers in Mumbai collaborate with law enforcement agencie...
 
一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理
一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理
一比一原版(UM毕业证书)美国密歇根大学安娜堡分校毕业证如何办理
 
A SHORT HISTORY OF LIBERTY'S PROGREE THROUGH HE EIGHTEENTH CENTURY
A SHORT HISTORY OF LIBERTY'S PROGREE THROUGH HE EIGHTEENTH CENTURYA SHORT HISTORY OF LIBERTY'S PROGREE THROUGH HE EIGHTEENTH CENTURY
A SHORT HISTORY OF LIBERTY'S PROGREE THROUGH HE EIGHTEENTH CENTURY
 

TAR versus Keyword Challenge

  • 2. Private and Confidential – Copyright 2019 Prevalence (Richness) • Percentage of population that is relevant. • Important when choosing methodology.
  • 3. Private and Confidential – Copyright 2019 Recall • Percentage of relevant docs found. • Defensibility.
  • 4. Private and Confidential – Copyright 2019 Precision • Percentage of retrieved docs that are relevant. • Related to cost (review effort). • 1/P = average docs reviewed per relevant doc found.
  • 5. Private and Confidential – Copyright 2019 Precision-Recall Curve
  • 6. Private and Confidential – Copyright 2019 Meaningful Comparison of Systems Fair Approaches: • Equal defensibility (recall), compare cost. • Equal cost, compare defensibility (recall). At least one system should achieve reasonable recall.
  • 7. Private and Confidential – Copyright 2019 Bad Metric: Accuracy • Percentage of predictions that are right. • Makes bad systems look good.
  • 8. Private and Confidential – Copyright 2019 Bad Metric: F1 Score • F1 = 2*P*R / (P + R) • Between P and R, closer to the smaller one.
  • 9. Private and Confidential – Copyright 2019 Are Research Results Relevant? • Many studies aren’t focused on e-discovery. • Appropriate metrics used? • Reasonable recall achieved? • Realistic data set?
  • 10. Private and Confidential – Copyright 2019 Keyword Search vs. TAR Search rules for the challenge (due to software limitations): • No phrase search, proximity search, wildcards, or stemming • Keywords are not case-sensitive • Boolean operators must be upper case • Weights (positive and negative) are OK. Default weight is 1. • Example: microsoft^2.5 OR (windows AND NOT house)^1.2 OR software Topics: • Law: existing law, excluding politics or proposed new law • Medical: business-oriented (not scientific) articles about the medical industry • Biology: mainstream science articles (not medical treatment) Submission: http://clustify.com/query Analysis of final results will be posted at: http://blog.clustify.com
  • 11. Private and Confidential – Copyright 2019 Keyword Search on Steroids scientists^1000 OR gene^990 OR genes^973 OR protein^856 OR proteins^804 OR biotechnology^774 OR cells^712 OR biology^708 OR dna^669 OR function^660 OR researchers^603 OR cell^600 OR human^543 OR expression^516 OR molecular^496 OR experiments^482 OR genetic^464 OR drugs^460 OR biotech^448 OR population^441 OR mammalian^427 OR development^411 OR sequence^404 OR investigators^374 OR novel^368 OR disease^361 OR wild^361 OR pharmaceutical^360 OR reagents^349 OR adult^349 OR scientific^345 OR island^344 OR antibody^335 OR rapid^328 OR synthesis^324 OR mouse^323 OR … OR reactions^-252 OR war^-252 OR populations^-264 OR computer^-283 OR effects^-289 OR optical^-291 OR electronic^-293 OR treatment^-296 OR risk^-310 OR society^-323 OR table^-344 OR learning^-346 OR tests^-412
  • 12. Private and Confidential – Copyright 2019 Finding Word Weights
  • 13. Private and Confidential – Copyright 2019 Training / Control Set Animation
  • 14. Private and Confidential – Copyright 2019 Keyword Search Strategies Similar to TAR 1.0 (SPL): • Review a random sample of docs. • Examine docs to find query keywords. • Repeat until query improvement is minimal. • Hard when prevalence is low. Similar to TAR 2.0 (CAL): • Create a query. • Review top docs from query. • Adjust query to add keywords from relevant docs and to suppress non-relevant docs. • Repeat until can’t find any more relevant docs. • Good when prevalence is low, but is it robust?
  • 15. Private and Confidential – Copyright 2019 TAR 2.0 Robust? - Weak Seed
  • 16. Private and Confidential – Copyright 2019 TAR 2.0 Robust? – Wrong Seed
  • 17. Private and Confidential – Copyright 2019 TAR 2.0 Robust? – Disjoint Relevance
  • 18. Private and Confidential – Copyright 2019 Toy Example Illustrating Workflows
  • 19. Private and Confidential – Copyright 2019 TAR 1.0
  • 20. Private and Confidential – Copyright 2019 TAR 2.0
  • 21. Private and Confidential – Copyright 2019 TAR 3.0
  • 22. Private and Confidential – Copyright 2019 Review Effort (All Candidates Reviewed)
  • 23. Private and Confidential – Copyright 2019 Review Effort (No Candidates Reviewed)
  • 24. Private and Confidential – Copyright 2019 Beyond Keywords • Use meta-data. • Feature engineering. • Adjacent word pairs instead of single words. • Non-linear relevance boundary. • Transformations to handle synonyms, etc. (LSA, word2vec, etc.)
  • 25. Private and Confidential – Copyright 2019 Tips • Think of TAR as a more systematic way to do keywords, plus more. • Beware of keyword search culling before applying TAR – many relevant docs probably lost. • Use the right performance metrics. • Choose the right TAR workflow for the situation.
  • 26. Private and Confidential – Copyright 2019 Misleading Metrics and Irrelevant Research (Accuracy and F1) https://blog.cluster-text.com/2018/12/12/misleading-metrics-and-irrelevant-research-accuracy-and-f1/ The Single Seed Hypothesis https://blog.cluster-text.com/2015/04/25/the-single-seed-hypothesis/ TAR 3.0 Performance https://blog.cluster-text.com/2016/01/28/tar-3-0-performance/ References

Editor's Notes

  1. Cost = review effort if we will review all docs that will potentially be produced. Precision does not account for review of training docs or control set (when doing TAR).
  2. We say “cost” instead of precision here because should take training docs and control set into account. Do NOT compare with different cost and different defensibility – cannot reach a conclusion unless the same system wins on both. How much cost should you be willing to trade for more defensibility? Depends on circumstances, so no good answer. If none of the methods achieves recall that is adequate for e-discovery, results aren’t relevant. At R=75%, 1-NN has P=6.6% and 40-NN has P=70.4%. 1-NN requires review of 15.2 docs per relevant doc found, whereas 40-NN requires only 1.4. 1-NN requires over 10x as much review.
  3. Mixes precision and recall, which measure very different things.
  4. This query is from using TAR 3.0 for training to find biology documents with SVM. Hundreds of words with positive weights – will miss very little (good for high recall). Precisely tuned positive and negative weights. Takes word correlation into account (some algorithms don’t). Can make use of broad words like “scientists” by adding words like “physics” with negative weight. Relies more on sorting of docs than on trying to pick the right subset. Doesn’t do something like “cell AND NOT (phone OR fuel OR solar)”, which could lose some relevant docs. Instead, use negative weights to push down in the sorted listing without losing. Based on actual data, not a guess.
  5. SVM Slope of the boundary line determines word weights. Explain margin.
  6. Shorter bars are better. Review effort includes training, control set, and review of docs predicted to be relevant (to achieve 75% recall). Tasks are ordered by descending prevalence (6.9% down to 0.3%).
  7. Meta-data: sender/recipient can be critical when looking for privileged docs. Feature engineering: Sender with only first name is probably spam.