SlideShare a Scribd company logo
1 of 36
Download to read offline
Evaluating Text
Extraction At Scale: A Case Study
from Apache Tika
Tim Allison, Ph.D.
Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization(1740)
ITSD
The research was carried out at the NASA (National Aeronautics
and Space Administration) Jet Propulsion Laboratory, California
Institute of Technology under a contract with the Defense
Advanced Research Projects Agency (DARPA) SafeDocs
program. © 2020 California Institute of Technology. Government
sponsorship acknowledged.
Reference herein to any specific commercial product, process,
or service by trade name, trademark, manufacturer, or
otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.
jpl.nasa.gov
About me
• Data scientist (files and search) Jet Propulsion
Laboratory, California Institute of Technology
• Chair/V.P. Apache Tika
• Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
• Member Apache Software Foundation
2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/20/20
The research was carried out at the NASA (National Aeronautics and Space Administration)
Jet Propulsion Laboratory, California Institute of Technology under a contract with the
Defense Advanced Research Projects Agency (DARPA) SafeDocs program.
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Outline
• Apache Tika, an overview
• What can possibly go wrong?
• tika-eval module
• Evaluation at scale and public corpora
10/20/20 3© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika, an overview
10/20/20 4
bytes
text and metadata
Framework for file
type detection,
parsing and uniform
output for ~75
parsers, ~100+
formats
https://tika.apache.org/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Apache Tika, features
• Easy to add new file types for detection
• Easy to add new parsers
• Works recursively with embedded files/attachments
• Integration with tesseract-ocr
10/20/20 5© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Embedded files/attachments
10/20/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Embedded files extracted
10/20/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged.
[ {
"Application-Name": "Microsoft Office Word",
"Content-Length": "27082",
"Content-Type": "application/....wordprocessingml.document",
"X-TIKA:content": "embed_0 ",
... },
{ "Content-Type": "text/plain; charset=ISO-8859-1",
"Last-Modified": "2014-06-04T04:08:28Z",
"X-TIKA:content": "embed_1a",
"X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt",
... },
{
"Content-Type": "application/zip",
"Last-Modified": "2014-06-04T04:09:40Z",
"X-TIKA:content": "embed4.txt",
"X-TIKA:embedded_resource_path":
"/embed1.zip/embed2.zip/embed3.zip/embed4.zip"
...}, ...]
jpl.nasa.gov
From PDF to …
10/20/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Metadata
10/20/20 9© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Text
10/20/20 10© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Metadata and Content in JSON
10/20/20 11© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What could possibly go wrong?
• Basic
• Thrown Exceptions – corrupt files, bad parser?
• Image only – need to have mechanism to determine
when to run OCR
• Password protected files
• Catastrophic
• Infinite loops/out of memory/seg fault
• Security compromises
10/20/20 12© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What could subtly go wrong?
• Missing text
• Missing attachments
• Garbled text
10/20/20 13© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Missing Text
10/20/20 14
File publicly available: https://issues.apache.org/jira/browse/TIKA-1130
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Garbled Text
10/20/20 15
Upgrade from
Apache PDFBox
1.8.6 to 1.8.7
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
tika-eval
10/20/20 16
• Library and commandline tool to calculate these
stats
• Profile a single batch of text extracts
• Compare two batches of extracts
• Different tools
• Different settings
• Coming soon – multi-compare
https://cwiki.apache.org/confluence/display/TIKA/TikaEval
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Out of vocabulary (OOV) – Same file, different Tika
10/20/20 17
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 786 156
Total Tokens 1603 272
LangId zh-ch de
Common Words 0 116
Alphabetic Tokens 1603 250
Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴:
14 | m: 11 | 杮湥: 11 | 瑵
捳: 11 | 畬杮: 11 | 档湥:
10 | 搠敩: 9 | 敮浨: 9
die: 11 | und: 8 | von: 8 |
deutschen: 7 | deutsche: 6 |
1: 5 | das: 5 | der: 5 |
finanzministerium: 5 | oder: 5
OOV% 1-(0/1603) = 100% 1-(116/250) = 54%
© 2020 California Institute of Technology. Government sponsorship acknowledged.
Overlap: 0%
Increase in Common Words: 116
jpl.nasa.gov
Out of vocabulary (OOV) – Same file, different Tika
10/20/20 18
Tika 1.14 Tika 1.15-SNAPSHOT
Unique Tokens 1916 1995
Total Tokens 14187 14302
LangId en en
Common Words 7498 7409
Alphabetic Tokens 13472 13587
Top 10 Unique
Tokens
applicant's: 8 | 1.69: 1 | arbitrary:
1 | collecting: 1 | constitution: 1 |
e112: 1 | ei.b: 1 | equating: 1 |
magnetically: 1 | o: 1
ss: 106 | applicantis: 8 | ssss: 7 |
iactsi: 4 | ithe: 4 | imeansi: 3 |
iprocessi: 3 | calculations.i: 2 |
iabstract: 2 | idata: 2
OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45%
© 2020 California Institute of Technology. Government sponsorship acknowledged.
Overlap: 95.5%
Increase in Common Words: -89
jpl.nasa.gov
OOV/Common Tokens corpus wide
Comparing the HTMLDefault encoding detector with Mozilla’s Universal
Encoding Detector on HTML encoding detection
10/20/20 19
HTMLDefault Universal HTMLDefault Sum
Common Tokens
Universal Sum
Common Tokens
Difference
in Sums
UTF-8 EUC-JP 4,437 481,919 477,482
EUC-JP Shift_JIS 1,512 391,126 389,614
UTF-16 windows-1252 1,240 368,496 367,256
UTF-16 UTF-8 2,563 321,717 319,154
EUC-JP UTF-8 764,957 1,047,029 282,072
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Scaling OOV in a set of English PDFs
10/20/20 20© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
From analytics to action
10/20/20 21
https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Stored text vs. Optical Character Recognition
10/20/20 22
Text As Stored in File
!"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB
D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc
5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a
lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a
Text from Tesseract OCR
Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest
Descent Method
Nobuhiko Ogura’ and Isao Yamada”
1 Introduction
A closed polyhedron is the intersection of finite number of closed half
spaces, i.e., the setof points satisfying finite number of lincar
incqualitics, and is widely used as a constraint in various application, for
example specifications or constraints in signal processing or estimation
problems, resource restrictions in financial applications and feasible sets of
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
What would Google do?
10/20/20 23
Text in Google’s Cache
Popat, Ashok. (2009). A panlingual anomalous text detector. 201-204.
10.1145/1600193.1600237.
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Uses
• Parser developers
• Production – search, Natural Language Processing,
etc…
• Which parsers/extract to use?
• Which encoding detector to trust?
• When to run OCR?
• Do not index/relevance weights
10/20/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Other indicators
• Low token/page ratio
• Language identification – Catalan in a largely
English corpus?!
• Median/mean word length and stdev
• Number of Unicode code blocks
• Alphabetic/non-alphabetic
10/20/20 25© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Limitations in absence of ground truth
• More exceptions – We have a problem! Wait…
• New parser, we were entirely skipping those file types
before
• Parser was yielding junk before on this file, now it is
letting us know there’s a problem
• Fewer exceptions – Great! Wait…
• Mime detection not working – skipping files that we
used to parse (theoretical)
• Now we’re getting junk
10/20/20 26© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Limitations in absence of ground truth
• More Common Words – Great! Wait…
• Serious bug that duplicates worksheets in some xlsx
files (TIKA-2356…my fault…ugh!)
• Fewer Common Words – Problem! Wait…
• More attachments, fewer attachments (Your turn!)
10/20/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Multi-compare example…in the works
10/20/20 28
• Set 0: 0 bytes
• Set 1: HÕll World!
• Set 2: H’ll World!
• Set 3: HÕllÿ World!
file1
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Taking tika-eval public
• Maruan Sahyoun kindly hosts a vm for ongoing
evals (TIKA-1302)
• 1 TB (~3 million files) from govdocs1, Common
Crawl and bug trackers
• Collaborating with Apache PDFBox and Apache POI
to run evals as part of the release process
10/20/20 29© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Taking tika-eval public
• Critical to identifying regressions and building new
parsers
• Stacktraces created by public documents are critical
for the hey-I’m-getting-this parse-
exception-but-can’t-share-the-
document-with-you problem
10/20/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Corpora
• Govdocs1 (https://digitalcorpora.org/corpora/files)
• Common Crawl (https://commoncrawl.org/)
• Bug trackers (see Peter Wyatt’s article:
https://www.pdfa.org/a-new-stressful-pdf-corpus/)
• Coming soon(?): Unit-test files from github parser
projects
10/20/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Corpora
10/20/20 32
https://corpora.tika.apache.org/base/docs/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Bug-trackers
10/20/20 33
https://corpora.tika.apache.org/base/docs/bug_trackers/
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
Please join the fun!
• https://tika.apache.org/
• https://cwiki.apache.org/confluence/display/TIKA/Tik
aEval
• corpora-dev@tika.apache.org
• https://issues.apache.org/jira/projects/TIKA
• @ApacheTika
10/20/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov10/20/20 35
Questions?
timothy.b.allison@jpl.nasa.gov
@_tallison
© 2020 California Institute of Technology. Government sponsorship acknowledged.
jpl.nasa.gov
10/20/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.

More Related Content

Similar to Evaluating Text Extraction at Scale: A case study from Apache Tika

Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Tim Allison
 
Embedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and OptionsEmbedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and OptionsTim Allison
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"Tim Allison
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeJonathan Yu
 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Jonathan Yu
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government DataRichard Cyganiak
 
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applicationsA novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applicationsHoopeer Hoopeer
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Conrad PDES Spring 2006
Conrad PDES Spring 2006Conrad PDES Spring 2006
Conrad PDES Spring 2006Mark Conrad
 
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...OpenAIRE
 
OpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish RepositoriesOpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish RepositoriesRIANIreland
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific DataRDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific DataASIS&T
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudDhaval Thakker
 
Patent scope ompi
Patent scope ompiPatent scope ompi
Patent scope ompiLATIPAT
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
dcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data cataloguesdcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data cataloguesRichard Cyganiak
 

Similar to Evaluating Text Extraction at Scale: A case study from Apache Tika (20)

Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2Haystack Live tallison_202010_v2
Haystack Live tallison_202010_v2
 
Embedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and OptionsEmbedded Files: Risks, Challenges and Options
Embedded Files: Risks, Challenges and Options
 
"Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild""Building a File Observatory: Making Sense of PDFs in the Wild"
"Building a File Observatory: Making Sense of PDFs in the Wild"
 
Visualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscapeVisualising the Australian open data and research data landscape
Visualising the Australian open data and research data landscape
 
Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017Australian Open government and research data pilot survey 2017
Australian Open government and research data pilot survey 2017
 
The State of Linked Government Data
The State of Linked Government DataThe State of Linked Government Data
The State of Linked Government Data
 
A novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applicationsA novel programmable attenuator based low Gm-OTA for biomedical applications
A novel programmable attenuator based low Gm-OTA for biomedical applications
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Oscon 2011 wylie
Oscon 2011 wylieOscon 2011 wylie
Oscon 2011 wylie
 
Conrad PDES Spring 2006
Conrad PDES Spring 2006Conrad PDES Spring 2006
Conrad PDES Spring 2006
 
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
OpenAIRE and the case of Irish Repositories, by Jochen Schirrwagen (RIAN Work...
 
OpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish RepositoriesOpenAIRE and the Case of Irish Repositories
OpenAIRE and the Case of Irish Repositories
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific DataRDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
RDAP13 Lorrie Johnson: Facilitating Access to Scientific Data
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
Patent scope ompi
Patent scope ompiPatent scope ompi
Patent scope ompi
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
 
dcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data cataloguesdcat: An RDF vocabulary for interoperability of data catalogues
dcat: An RDF vocabulary for interoperability of data catalogues
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Evaluating Text Extraction at Scale: A case study from Apache Tika

  • 1. Evaluating Text Extraction At Scale: A Case Study from Apache Tika Tim Allison, Ph.D. Data Scientist/Relevance Engineer Artificial Intelligence, Analytics and Innovative Development Organization(1740) ITSD The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2020 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
  • 2. jpl.nasa.gov About me • Data scientist (files and search) Jet Propulsion Laboratory, California Institute of Technology • Chair/V.P. Apache Tika • Committer Apache PDFBox, POI, Lucene/Solr, OpenNLP • Member Apache Software Foundation 2© 2020 California Institute of Technology. Government sponsorship acknowledged.10/20/20 The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 3. jpl.nasa.gov Outline • Apache Tika, an overview • What can possibly go wrong? • tika-eval module • Evaluation at scale and public corpora 10/20/20 3© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 4. jpl.nasa.gov Apache Tika, an overview 10/20/20 4 bytes text and metadata Framework for file type detection, parsing and uniform output for ~75 parsers, ~100+ formats https://tika.apache.org/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 5. jpl.nasa.gov Apache Tika, features • Easy to add new file types for detection • Easy to add new parsers • Works recursively with embedded files/attachments • Integration with tesseract-ocr 10/20/20 5© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 6. jpl.nasa.gov Embedded files/attachments 10/20/20 6© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 7. jpl.nasa.gov Embedded files extracted 10/20/20 7© 2020 California Institute of Technology. Government sponsorship acknowledged. [ { "Application-Name": "Microsoft Office Word", "Content-Length": "27082", "Content-Type": "application/....wordprocessingml.document", "X-TIKA:content": "embed_0 ", ... }, { "Content-Type": "text/plain; charset=ISO-8859-1", "Last-Modified": "2014-06-04T04:08:28Z", "X-TIKA:content": "embed_1a", "X-TIKA:embedded_resource_path": "/embed1.zip/embed1a.txt", ... }, { "Content-Type": "application/zip", "Last-Modified": "2014-06-04T04:09:40Z", "X-TIKA:content": "embed4.txt", "X-TIKA:embedded_resource_path": "/embed1.zip/embed2.zip/embed3.zip/embed4.zip" ...}, ...]
  • 8. jpl.nasa.gov From PDF to … 10/20/20 8© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 9. jpl.nasa.gov Metadata 10/20/20 9© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 10. jpl.nasa.gov Text 10/20/20 10© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 11. jpl.nasa.gov Metadata and Content in JSON 10/20/20 11© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 12. jpl.nasa.gov What could possibly go wrong? • Basic • Thrown Exceptions – corrupt files, bad parser? • Image only – need to have mechanism to determine when to run OCR • Password protected files • Catastrophic • Infinite loops/out of memory/seg fault • Security compromises 10/20/20 12© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 13. jpl.nasa.gov What could subtly go wrong? • Missing text • Missing attachments • Garbled text 10/20/20 13© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 14. jpl.nasa.gov Missing Text 10/20/20 14 File publicly available: https://issues.apache.org/jira/browse/TIKA-1130 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 15. jpl.nasa.gov Garbled Text 10/20/20 15 Upgrade from Apache PDFBox 1.8.6 to 1.8.7 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 16. jpl.nasa.gov tika-eval 10/20/20 16 • Library and commandline tool to calculate these stats • Profile a single batch of text extracts • Compare two batches of extracts • Different tools • Different settings • Coming soon – multi-compare https://cwiki.apache.org/confluence/display/TIKA/TikaEval © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 17. jpl.nasa.gov Out of vocabulary (OOV) – Same file, different Tika 10/20/20 17 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 786 156 Total Tokens 1603 272 LangId zh-ch de Common Words 0 116 Alphabetic Tokens 1603 250 Top N Tokens 捳敨: 18 | 獴档: 14 | 略獴: 14 | m: 11 | 杮湥: 11 | 瑵 捳: 11 | 畬杮: 11 | 档湥: 10 | 搠敩: 9 | 敮浨: 9 die: 11 | und: 8 | von: 8 | deutschen: 7 | deutsche: 6 | 1: 5 | das: 5 | der: 5 | finanzministerium: 5 | oder: 5 OOV% 1-(0/1603) = 100% 1-(116/250) = 54% © 2020 California Institute of Technology. Government sponsorship acknowledged. Overlap: 0% Increase in Common Words: 116
  • 18. jpl.nasa.gov Out of vocabulary (OOV) – Same file, different Tika 10/20/20 18 Tika 1.14 Tika 1.15-SNAPSHOT Unique Tokens 1916 1995 Total Tokens 14187 14302 LangId en en Common Words 7498 7409 Alphabetic Tokens 13472 13587 Top 10 Unique Tokens applicant's: 8 | 1.69: 1 | arbitrary: 1 | collecting: 1 | constitution: 1 | e112: 1 | ei.b: 1 | equating: 1 | magnetically: 1 | o: 1 ss: 106 | applicantis: 8 | ssss: 7 | iactsi: 4 | ithe: 4 | imeansi: 3 | iprocessi: 3 | calculations.i: 2 | iabstract: 2 | idata: 2 OOV% 1-(7498/13472) = 44% 1-(7409/13587) = 45% © 2020 California Institute of Technology. Government sponsorship acknowledged. Overlap: 95.5% Increase in Common Words: -89
  • 19. jpl.nasa.gov OOV/Common Tokens corpus wide Comparing the HTMLDefault encoding detector with Mozilla’s Universal Encoding Detector on HTML encoding detection 10/20/20 19 HTMLDefault Universal HTMLDefault Sum Common Tokens Universal Sum Common Tokens Difference in Sums UTF-8 EUC-JP 4,437 481,919 477,482 EUC-JP Shift_JIS 1,512 391,126 389,614 UTF-16 windows-1252 1,240 368,496 367,256 UTF-16 UTF-8 2,563 321,717 319,154 EUC-JP UTF-8 764,957 1,047,029 282,072 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 20. jpl.nasa.gov Scaling OOV in a set of English PDFs 10/20/20 20© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 21. jpl.nasa.gov From analytics to action 10/20/20 21 https://aviris.jpl.nasa.gov/proceedings/workshops/02_docs/2002_Ogura_1_web.pdf © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 22. jpl.nasa.gov Stored text vs. Optical Character Recognition 10/20/20 22 Text As Stored in File !"#%$& (') *,+-).' / 0 1,23 *. 457698;:;<>=75?&@78;ACB D(B7E;FHGJICBK5MLNBKOPBKF;B DJD Q R S.TVU9WNXMY[ZT]^W_S `badc 5KICedFgfh5 cji :;edF;A^5KEk<>Imln:;e[<>EnloedACICe a lo<p57Eg5Kqsr;E;<jloe[E 8;O 6hedA5Kq adc 57ItedFk:;B c qsICf;B a Text from Tesseract OCR Constrained Least Squares Linear Spectral Unmixture by the Hybrid Steepest Descent Method Nobuhiko Ogura’ and Isao Yamada” 1 Introduction A closed polyhedron is the intersection of finite number of closed half spaces, i.e., the setof points satisfying finite number of lincar incqualitics, and is widely used as a constraint in various application, for example specifications or constraints in signal processing or estimation problems, resource restrictions in financial applications and feasible sets of © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 23. jpl.nasa.gov What would Google do? 10/20/20 23 Text in Google’s Cache Popat, Ashok. (2009). A panlingual anomalous text detector. 201-204. 10.1145/1600193.1600237. © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 24. jpl.nasa.gov Uses • Parser developers • Production – search, Natural Language Processing, etc… • Which parsers/extract to use? • Which encoding detector to trust? • When to run OCR? • Do not index/relevance weights 10/20/20 24© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 25. jpl.nasa.gov Other indicators • Low token/page ratio • Language identification – Catalan in a largely English corpus?! • Median/mean word length and stdev • Number of Unicode code blocks • Alphabetic/non-alphabetic 10/20/20 25© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 26. jpl.nasa.gov Limitations in absence of ground truth • More exceptions – We have a problem! Wait… • New parser, we were entirely skipping those file types before • Parser was yielding junk before on this file, now it is letting us know there’s a problem • Fewer exceptions – Great! Wait… • Mime detection not working – skipping files that we used to parse (theoretical) • Now we’re getting junk 10/20/20 26© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 27. jpl.nasa.gov Limitations in absence of ground truth • More Common Words – Great! Wait… • Serious bug that duplicates worksheets in some xlsx files (TIKA-2356…my fault…ugh!) • Fewer Common Words – Problem! Wait… • More attachments, fewer attachments (Your turn!) 10/20/20 27© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 28. jpl.nasa.gov Multi-compare example…in the works 10/20/20 28 • Set 0: 0 bytes • Set 1: HÕll World! • Set 2: H’ll World! • Set 3: HÕllÿ World! file1 © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 29. jpl.nasa.gov Taking tika-eval public • Maruan Sahyoun kindly hosts a vm for ongoing evals (TIKA-1302) • 1 TB (~3 million files) from govdocs1, Common Crawl and bug trackers • Collaborating with Apache PDFBox and Apache POI to run evals as part of the release process 10/20/20 29© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 30. jpl.nasa.gov Taking tika-eval public • Critical to identifying regressions and building new parsers • Stacktraces created by public documents are critical for the hey-I’m-getting-this parse- exception-but-can’t-share-the- document-with-you problem 10/20/20 30© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 31. jpl.nasa.gov Corpora • Govdocs1 (https://digitalcorpora.org/corpora/files) • Common Crawl (https://commoncrawl.org/) • Bug trackers (see Peter Wyatt’s article: https://www.pdfa.org/a-new-stressful-pdf-corpus/) • Coming soon(?): Unit-test files from github parser projects 10/20/20 31© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 32. jpl.nasa.gov Corpora 10/20/20 32 https://corpora.tika.apache.org/base/docs/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 33. jpl.nasa.gov Bug-trackers 10/20/20 33 https://corpora.tika.apache.org/base/docs/bug_trackers/ © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 34. jpl.nasa.gov Please join the fun! • https://tika.apache.org/ • https://cwiki.apache.org/confluence/display/TIKA/Tik aEval • corpora-dev@tika.apache.org • https://issues.apache.org/jira/projects/TIKA • @ApacheTika 10/20/20 34© 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 35. jpl.nasa.gov10/20/20 35 Questions? timothy.b.allison@jpl.nasa.gov @_tallison © 2020 California Institute of Technology. Government sponsorship acknowledged.
  • 36. jpl.nasa.gov 10/20/20 36© 2020 California Institute of Technology. Government sponsorship acknowledged.