Submit Search
Upload
Evaluating Text Extraction at Scale: A case study from Apache Tika
•
0 likes
•
93 views
T
Tim Allison
Follow
Evaluating text extraction at scale.
Read less
Read more
Technology
Report
Share
Report
Share
1 of 36
Download now
Download to read offline
Recommended
20141112 courtot big_datasemwebontologies
20141112 courtot big_datasemwebontologies
Melanie Courtot
LabTech Introduction
LabTech Introduction
VIA Next Innovators
The EPO document collection:A technical treasure chest
The EPO document collection:A technical treasure chest
GO opleidingen
Triple Stores
Triple Stores
Stephan Volmer
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
Bianca Pereira
Introduction to RDF and related Vocabularies/Languages. Introduction to SPARQL
Introduction to RDF and related Vocabularies/Languages. Introduction to SPARQL
PretaLLOD
II-SDV 2014 Design and development of a novel Patent Alerting Service (Bayer ...
II-SDV 2014 Design and development of a novel Patent Alerting Service (Bayer ...
Dr. Haxel Consult
Linked Open Data (LOD) part 2
Linked Open Data (LOD) part 2
IPLODProject
Recommended
20141112 courtot big_datasemwebontologies
20141112 courtot big_datasemwebontologies
Melanie Courtot
LabTech Introduction
LabTech Introduction
VIA Next Innovators
The EPO document collection:A technical treasure chest
The EPO document collection:A technical treasure chest
GO opleidingen
Triple Stores
Triple Stores
Stephan Volmer
DBpedia as Gaeilge Chapter
DBpedia as Gaeilge Chapter
Bianca Pereira
Introduction to RDF and related Vocabularies/Languages. Introduction to SPARQL
Introduction to RDF and related Vocabularies/Languages. Introduction to SPARQL
PretaLLOD
II-SDV 2014 Design and development of a novel Patent Alerting Service (Bayer ...
II-SDV 2014 Design and development of a novel Patent Alerting Service (Bayer ...
Dr. Haxel Consult
Linked Open Data (LOD) part 2
Linked Open Data (LOD) part 2
IPLODProject
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
Tim Allison
Embedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and Options
Tim Allison
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
Tim Allison
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
Jonathan Yu
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
Jonathan Yu
The State of Linked Government Data
The State of Linked Government Data
Richard Cyganiak
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
Hoopeer Hoopeer
Tds — big science dec 2021
Tds — big science dec 2021
Gérard Dupont
Oscon 2011 wylie
Oscon 2011 wylie
will-schroeder
Conrad PDES Spring 2006
Conrad PDES Spring 2006
Mark Conrad
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE
OpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish Repositories
RIANIreland
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
ASIS&T
Open Data and Web API
Open Data and Web API
Sammy Fung
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
Dhaval Thakker
Patent scope ompi
Patent scope ompi
LATIPAT
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
PRBETTER
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
Luiz Olavo Bonino da Silva Santos
dcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data catalogues
Richard Cyganiak
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Raghuram Pandurangan
More Related Content
Similar to Evaluating Text Extraction at Scale: A case study from Apache Tika
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
Tim Allison
Embedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and Options
Tim Allison
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
Tim Allison
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
Jonathan Yu
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
Jonathan Yu
The State of Linked Government Data
The State of Linked Government Data
Richard Cyganiak
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
Hoopeer Hoopeer
Tds — big science dec 2021
Tds — big science dec 2021
Gérard Dupont
Oscon 2011 wylie
Oscon 2011 wylie
will-schroeder
Conrad PDES Spring 2006
Conrad PDES Spring 2006
Mark Conrad
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE
OpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish Repositories
RIANIreland
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
ASIS&T
Open Data and Web API
Open Data and Web API
Sammy Fung
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
Dhaval Thakker
Patent scope ompi
Patent scope ompi
LATIPAT
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
PRBETTER
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
Luiz Olavo Bonino da Silva Santos
dcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data catalogues
Richard Cyganiak
Similar to Evaluating Text Extraction at Scale: A case study from Apache Tika
(20)
Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
Embedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and Options
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
The State of Linked Government Data
The State of Linked Government Data
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
Tds — big science dec 2021
Tds — big science dec 2021
Oscon 2011 wylie
Oscon 2011 wylie
Conrad PDES Spring 2006
Conrad PDES Spring 2006
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish Repositories
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
Open Data and Web API
Open Data and Web API
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
Patent scope ompi
Patent scope ompi
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
dcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data catalogues
Recently uploaded
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Raghuram Pandurangan
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
LoriGlavin3
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Alex Barbosa Coqueiro
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
MounikaPolabathina
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Stephanie Beckett
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Alan Dix
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
BookNet Canada
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
LoriGlavin3
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Fwdays
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Precisely
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
LoriGlavin3
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
Curtis Poe
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
UiPathCommunity
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
LoriGlavin3
Recently uploaded
(20)
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
Evaluating Text Extraction at Scale: A case study from Apache Tika
1.
Evaluating Text Extraction At
Scale: A Case Study from Apache Tika Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization(1740) ITSD The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2020 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
2.
jpl.nasa.gov About me • Data
scientist (files and search) Jet Propulsion Laboratory, California Institute of Technology • Chair/V.P. Apache Tika • Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP • Member Apache Software Foundation 2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/20/20 The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2020 California Institute of Technology. Government sponsorship acknowledged.
3.
jpl.nasa.gov Outline • Apache Tika,
an overview • What can possibly go wrong? • tika-eval module • Evaluation at scale and public corpora 10/20/20 3© 2020 California Institute of Technology. Government sponsorship acknowledged.
4.
jpl.nasa.gov Apache Tika, an
overview 10/20/20 4 bytes text and metadata Framework for file type detection, parsing and uniform output for ~75 parsers, ~100+ formats https://tika.apache.org/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
5.
jpl.nasa.gov Apache Tika, features •
Easy to add new file types for detection • Easy to add new parsers • Works recursively with embedded files/attachments • Integration with tesseract-ocr 10/20/20 5© 2020 California Institute of Technology. Government sponsorship acknowledged.
6.
jpl.nasa.gov Embedded files/attachments 10/20/20 6©
2020 California Institute of Technology. Government sponsorship acknowledged.
7.
jpl.nasa.gov Embedded files extracted 10/20/20
7© 2020 California Institute of Technology. Government sponsorship acknowledged. [ { "Application-Name": "Microsoft Office Word", "Content-Length": "27082", "Content-Type": "application/....wordprocessingml.document", "X-TIKA:content": "embed_0 ", ... }, { "Content-Type": "text/plain; charset=ISO-8859-1", "Last-Modified": "2014-06-04T04:08:28Z", "X-TIKA:content": "embed_1a", "X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt", ... }, { "Content-Type": "application/zip", "Last-Modified": "2014-06-04T04:09:40Z", "X-TIKA:content": "embed4.txt", "X-TIKA:embedded_resource_path": "/embed1.zip/embed2.zip/embed3.zip/embed4.zip" ...}, ...]
8.
jpl.nasa.gov From PDF to
… 10/20/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
9.
jpl.nasa.gov Metadata 10/20/20 9© 2020
California Institute of Technology. Government sponsorship acknowledged.
10.
jpl.nasa.gov Text 10/20/20 10© 2020
California Institute of Technology. Government sponsorship acknowledged.
11.
jpl.nasa.gov Metadata and Content
in JSON 10/20/20 11© 2020 California Institute of Technology. Government sponsorship acknowledged.
12.
jpl.nasa.gov What could possibly
go wrong? • Basic • Thrown Exceptions – corrupt files, bad parser? • Image only – need to have mechanism to determine when to run OCR • Password protected files • Catastrophic • Infinite loops/out of memory/seg fault • Security compromises 10/20/20 12© 2020 California Institute of Technology. Government sponsorship acknowledged.
13.
jpl.nasa.gov What could subtly
go wrong? • Missing text • Missing attachments • Garbled text 10/20/20 13© 2020 California Institute of Technology. Government sponsorship acknowledged.
14.
jpl.nasa.gov Missing Text 10/20/20 14 File
publicly available: https://issues.apache.org/jira/browse/TIKA-1130 © 2020 California Institute of Technology. Government sponsorship acknowledged.
15.
jpl.nasa.gov Garbled Text 10/20/20 15 Upgrade
from Apache PDFBox 1.8.6 to 1.8.7 © 2020 California Institute of Technology. Government sponsorship acknowledged.
16.
jpl.nasa.gov tika-eval 10/20/20 16 • Library
and commandline tool to calculate these stats • Profile a single batch of text extracts • Compare two batches of extracts • Different tools • Different settings • Coming soon – multi-compare https://cwiki.apache.org/confluence/display/TIKA/TikaEval © 2020 California Institute of Technology. Government sponsorship acknowledged.
17.
jpl.nasa.gov Out of vocabulary
(OOV) – Same file, different Tika 10/20/20 17 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m: 11 | 杮湥: 11 | 瑵 捳: 11 | 畬杮: 11 | 档湥: 10 | 搠敩: 9 | 敮浨: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% © 2020 California Institute of Technology. Government sponsorship acknowledged. Overlap: 0% Increase in Common Words: 116
18.
jpl.nasa.gov Out of vocabulary
(OOV) – Same file, different Tika 10/20/20 18 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 1916 1995 Total Tokens 14187 14302 LangId en en Common Words 7498 7409 Alphabetic Tokens 13472 13587 Top 10 Unique Tokens applicant's: 8 | 1.69: 1 | arbitrary: 1 | collecting: 1 | constitution: 1 | e112: 1 | ei.b: 1 | equating: 1 | magnetically: 1 | o: 1 ss: 106 | applicantis: 8 | ssss: 7 | iactsi: 4 | ithe: 4 | imeansi: 3 | iprocessi: 3 | calculations.i: 2 | iabstract: 2 | idata: 2 OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45% © 2020 California Institute of Technology. Government sponsorship acknowledged. Overlap: 95.5% Increase in Common Words: -89
19.
jpl.nasa.gov OOV/Common Tokens corpus
wide Comparing the HTMLDefault encoding detector with Mozilla’s Universal Encoding Detector on HTML encoding detection 10/20/20 19 HTMLDefault Universal HTMLDefault Sum Common Tokens Universal Sum Common Tokens Difference in Sums UTF-8 EUC-JP 4,437 481,919 477,482 EUC-JP Shift_JIS 1,512 391,126 389,614 UTF-16 windows-1252 1,240 368,496 367,256 UTF-16 UTF-8 2,563 321,717 319,154 EUC-JP UTF-8 764,957 1,047,029 282,072 © 2020 California Institute of Technology. Government sponsorship acknowledged.
20.
jpl.nasa.gov Scaling OOV in
a set of English PDFs 10/20/20 20© 2020 California Institute of Technology. Government sponsorship acknowledged.
21.
jpl.nasa.gov From analytics to
action 10/20/20 21 https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf © 2020 California Institute of Technology. Government sponsorship acknowledged.
22.
jpl.nasa.gov Stored text vs.
Optical Character Recognition 10/20/20 22 Text As Stored in File !"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc 5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a Text from Tesseract OCR Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest Descent Method Nobuhiko Ogura’ and Isao Yamada” 1 Introduction A closed polyhedron is the intersection of finite number of closed half spaces, i.e., the setof points satisfying finite number of lincar incqualitics, and is widely used as a constraint in various application, for example specifications or constraints in signal processing or estimation problems, resource restrictions in financial applications and feasible sets of © 2020 California Institute of Technology. Government sponsorship acknowledged.
23.
jpl.nasa.gov What would Google
do? 10/20/20 23 Text in Google’s Cache Popat, Ashok. (2009). A panlingual anomalous text detector. 201-204. 10.1145/1600193.1600237. © 2020 California Institute of Technology. Government sponsorship acknowledged.
24.
jpl.nasa.gov Uses • Parser developers •
Production – search, Natural Language Processing, etc… • Which parsers/extract to use? • Which encoding detector to trust? • When to run OCR? • Do not index/relevance weights 10/20/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
25.
jpl.nasa.gov Other indicators • Low
token/page ratio • Language identification – Catalan in a largely English corpus?! • Median/mean word length and stdev • Number of Unicode code blocks • Alphabetic/non-alphabetic 10/20/20 25© 2020 California Institute of Technology. Government sponsorship acknowledged.
26.
jpl.nasa.gov Limitations in absence
of ground truth • More exceptions – We have a problem! Wait… • New parser, we were entirely skipping those file types before • Parser was yielding junk before on this file, now it is letting us know there’s a problem • Fewer exceptions – Great! Wait… • Mime detection not working – skipping files that we used to parse (theoretical) • Now we’re getting junk 10/20/20 26© 2020 California Institute of Technology. Government sponsorship acknowledged.
27.
jpl.nasa.gov Limitations in absence
of ground truth • More Common Words – Great! Wait… • Serious bug that duplicates worksheets in some xlsx files (TIKA-2356…my fault…ugh!) • Fewer Common Words – Problem! Wait… • More attachments, fewer attachments (Your turn!) 10/20/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
28.
jpl.nasa.gov Multi-compare example…in the
works 10/20/20 28 • Set 0: 0 bytes • Set 1: HÕll World! • Set 2: H’ll World! • Set 3: HÕllÿ World! file1 © 2020 California Institute of Technology. Government sponsorship acknowledged.
29.
jpl.nasa.gov Taking tika-eval public •
Maruan Sahyoun kindly hosts a vm for ongoing evals (TIKA-1302) • 1 TB (~3 million files) from govdocs1, Common Crawl and bug trackers • Collaborating with Apache PDFBox and Apache POI to run evals as part of the release process 10/20/20 29© 2020 California Institute of Technology. Government sponsorship acknowledged.
30.
jpl.nasa.gov Taking tika-eval public •
Critical to identifying regressions and building new parsers • Stacktraces created by public documents are critical for the hey-I’m-getting-this parse- exception-but-can’t-share-the- document-with-you problem 10/20/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
31.
jpl.nasa.gov Corpora • Govdocs1 (https://digitalcorpora.org/corpora/files) •
Common Crawl (https://commoncrawl.org/) • Bug trackers (see Peter Wyatt’s article: https://www.pdfa.org/a-new-stressful-pdf-corpus/) • Coming soon(?): Unit-test files from github parser projects 10/20/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
32.
jpl.nasa.gov Corpora 10/20/20 32 https://corpora.tika.apache.org/base/docs/ © 2020
California Institute of Technology. Government sponsorship acknowledged.
33.
jpl.nasa.gov Bug-trackers 10/20/20 33 https://corpora.tika.apache.org/base/docs/bug_trackers/ © 2020
California Institute of Technology. Government sponsorship acknowledged.
34.
jpl.nasa.gov Please join the
fun! • https://tika.apache.org/ • https://cwiki.apache.org/confluence/display/TIKA/Tik aEval • corpora-dev@tika.apache.org • https://issues.apache.org/jira/projects/TIKA • @ApacheTika 10/20/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
35.
jpl.nasa.gov10/20/20 35 Questions? timothy.b.allison@jpl.nasa.gov @_tallison © 2020
California Institute of Technology. Government sponsorship acknowledged.
36.
jpl.nasa.gov 10/20/20 36© 2020
California Institute of Technology. Government sponsorship acknowledged.
Download now