SlideShare uma empresa Scribd logo
1 de 65
Baixar para ler offline
What's new with Apache Tika?What's new with Apache Tika?
What's New with Apache Tika?
Nick BurchNick Burch @Gagravarr@Gagravarr
CTO, QuanticateCTO, Quanticate
What's New with Apache Tika?
Nick BurchNick Burch @Gagravarr@Gagravarr
CTO, QuanticateCTO, Quanticate
Tika, in a nutshell
“small, yellow and leech-like, and
probably the oddest thing in the
Universe”
• Like a Babel Fish for content!
• Helps you work out what sort of thing
your content (1s & 0s) is
• Helps you extract the metadata from it,
in a consistent way
• Lets you get a plain text version of your
content, eg for full text indexing
• Provides a rich (XHTML) version too
A bit of historyA bit of history
Before Tika
• In the early 2000s, everyone was building a search engine /
search system for their CMS / web spider / etc
• Lucene mailing list and wiki had lots of code snippets for
using libraries to extract text
• Lots of bugs, people using old versions, people missing out
on useful formats, confusion abounded
• Handful of commercial libraries, generally expensive and
aimed at large companies and/or computer forensics
• Everyone was re-inventing the wheel, and doing it badly....
Tika's History (in brief)
• The idea from Tika first came from the Apache Nutch
project, who wanted to get useful things out of all the
content they were spidering and indexing
• The Apache Lucene project (which Nutch used) were also
interested, as lots of people there had the same problems
• Ideas and discussions started in 2006
• Project founded in 2007, in the Apache Incubator
• Initial contributions from Nutch, Lucene and Lius
• Graduated in 2008, v1.0 in 2011
Tika Releases
27/12/2007 27/12/2009 27/12/2011 27/12/2013
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Releases
A (brief) introduction to TikaA (brief) introduction to Tika
(Some) Supported Formats
• HTML, XHTML, XML
• Microsoft Office – Word, Excel, PowerPoint, Works,
Publisher, Visio – Binary and OOXML formats
• OpenDocument (OpenOffice)
• iWorks – Keynote, Pages, Numbers
• PDF, RTF, Plain Text, CHM Help
• Compression / Archive – Zip, Tar, Ar, 7z, bz2, gz etc
• Atom, RSS, ePub Lots of Scientific formats
• Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav
• Image – JPEG, TIFF, PNG, BMP, GIF, ICO
Detection
• Work out what kind of file something is
• Based on a mixture of things
• Filename
• Mime magic (first few hundred bytes)
• Dedicated code (eg containers)
• Some combination of all of these
• Can be used as a standalone – what is this thing?
• Can be combined with parsers – figure out what this is,
then find a parser to work on it
Metadata
• Describes a file
• eg Title, Author, Creation Date, Location
• Tika provides a way to extract this (where present)
• However, each file format tends to have its own kind of
metadata, which can vary a lot
• eg Author, Creator, Created By, First Author, Creator[0]
• Tika tries to map file format specific metadata onto
common, consistent metadata keys
• “Give me the thing that closest represents what Dublin
Core defines as Creator”
Plain Text
• Most file formats include at least some text
• For a plain text file, that's everything in it!
• For others, it's only part
• Lots of libraries out there which can extract text, but how
you call them varies a lot
• Tika wraps all that up for you, and gives consistentency
• Plain Text is ideal for things like Full Text Indexing, eg to
feed into SOLR, Lucene or ElasticSearch
XHTML
• Structured Text extraction
• Outputs SAX events for the tags and text of a file
• This is actually the Tika default, Plain Text is implemented
by only catching the Text parts of the SAX output
• Isn't supposed to be the “exact representation”
• Aims to give meaningful, semantic but simple output
• Can be used for basic previews
• Can be used to filter, eg ignore header + footer then give
remainder as plain text
What's New?What's New?
Formats and ParsersFormats and Parsers
Supported Formats
• HTML
• XML
• Microsoft Office
• Word
• PowerPoint
• Excel (2,3,4,5,97+)
• Visio
• Outlook
Supported Formats
• Open Document Format (ODF)
• iWorks
• PDF
• ePUB
• RTF
• Tar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200
• Plain Text
• RSS and Atom
Supported Formats
• IPTC ANPA Newswire
• CHM Help
• Wav, MIDI
• MP3, MP4 Audio
• Ogg Vorbis, Speex, FLAC, Opus
• PNG, JPG, BMP, TIFF, BPG
• FLV, MP4 Video
• Java classes
Supported Formats
• Source Code
• Mbox, RFC822, Outlook PST, Outlook MSG, TNEF
• DWG CAD
• DIF, GDAL, ISO-19139, Grib, HDF, ISA-Tab, NetCDF, Matlab
• Executables (Windows, Linux, Mac)
• Pkcs7
• SQLite
• Microsoft Access
OCROCR
OCR
• What if you don't have a text file, but instead a photo of
some text? Or a scan of some text?
• OCR (Optical Character Recognition) to the rescue!
• Tesseract is an Open Source OCR tool
• Tika has a parser which'll call out to Tesseract for suitable
images found
• Tesseract is found and used if on the path
• Explicit path can be given, or can be disabled
Container FormatsContainer Formats
DatabasesDatabases
Databases
• A surprising number of Database and “database” systems
have a single-file mode
• If there's a single file, and a suitable library or program,
then Tika can get the data out!
• Main ones so far are MS Access & SQLite
• How best to represent the contents in XHTML?
• One HTML table per Database Table best we have, so far!
Tika Config XMLTika Config XML
Tika Config XML
• Using Config, you can specify what Parsers, Detectors,
Translator, Service Loader and Mime Types to use
• You can do it explicitly
• You can do it implicitly (with defaults)
• You can do “default except”
• Tools available to dump out a running config as XML
• Use the Tika App to see what you have
Tika Config XML example
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude>
<parser-exclude
class="org.apache.tika.parser.executable.ExecutableParser"/>
</parser>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
Embedded ResourcesEmbedded Resources
Tika AppTika App
Tika ServerTika Server
OSGiOSGi
Tika BatchTika Batch
Apache cTAKESApache cTAKES
TroubleshootingTroubleshooting
Troubleshooting
• http://wiki.apache.org/tika/Troubleshooting%20Tika
What's Coming Soon?What's Coming Soon?
Apache Tika 1.11Apache Tika 1.11
Tika 1.11
• Library upgrades for bug fixes (POI, PDFBox etc)
• Tika Config XML enhancements
• Tika Config XML output / dumping
• Apache Commons IO used more widely
• GROBID
• Hopefully due in a few weeks!
Apache Tika 1.12+Apache Tika 1.12+
Tika 1.12+
• Commons IO in Core? TBD
• Java 7 Paths – where java.io.File used
• More NLP enhancement / augmentation
• Metadata aliasing
• Plus preparations for Tika 2
Tika 2.0Tika 2.0
Why no Tika v2 yet?
• Apache Tika 0.1 – December 2007
• Apache Tika 1.0 – November 2011
• Shouldn't we have had a v2 by now?
• Discussions started several years ago, on the list
• Plans for what we need on the wiki for ~1 year
• Largely though, every time someone came up with a breaking
feature for 2.0, a compatible way to do it was found!
Deprecated Parts
• Various parts of Tika have been deprecated over the years
• All of those will go!
• Main ones that might bite you:
• Parser parse with no ParseContext
• Old style Metadata keys
Metadata Storage
• Currently, Metadata in Tika is String Key/Value Lists
• Many Metadata types have Properties, which provide
typing, conversions, sanity checks etc
• But all still stored as String Key + Value(s)
• Some people think we need a richer storage model
• Others want to keep it simple!
• JSON, XML DOM, XMP being debated
• Richer string keys also proposed
Java Packaging of Tika
• Maven Packages of Tika are
• Tika Core
• Tika Parsers
• Tika Bundle
• Tika XMP
• Tika Java 7
• For just some parsers, you need to exclude maven
dependencies
• Should we have “Tika Parser PDF”, “Tika Parsers ODF” etc?
Fallback/Preference Parsers
• If we have several parsers that can handle a format
• Preferences?
• If one fails, how about trying others?
Multiple Parsers
• If we have several parsers that can handle a format
• What about running all of them?
• eg extract image metadata
• then OCR it
• then try a second parser for more metadata
Parser Discovery/Loading?
• Currently, Tika uses a Service Loader mechanism to find
and load available Parsers (and Detectors+Translators)
• This allows you to drop a new Tika parser jar onto the
classpath, and have it automatically used
• Also allows you to miss one or two jars out, and not get any
content back with no warnings / errors...
• You can set the Service Loader to Warn, or even Error
• But most people don't, and it bites them!
• Change the default in 2? Or change entirely how we do it?
What we still need help with...What we still need help with...
Content Handler Reset/Add
• Tika uses the SAX Content Handler interface for supplying
plain text
• Streaming, write once
• How does that work with multiple parsers?
Content Enhancement
• How can we post-process the content to “enhance” it in
various ways?
• For example, how can we mark up parts of speach?
• Pull out information into the Metadata?
• Translate it, retaining the original positions?
• For just some formats, or for all?
• For just some documents in some formats?
• While still keeping the Streaming SAX-like contract?
Metadata Standards
• Currently, Tika works hard to map file-format-specific
metadata onto general metadata standards
• Means you don't have to know each standard in depth, can
just say “give me the closest to dc:subject you have, no
matter what file format or library it comes from”
• What about non-File-format metadata, such as content
metadata (Table of Contents, Author information etc)?
• What about combining things?
Richer Metadata
• See Metadata Storage slides!
Bonus! Apache Tika at ScaleBonus! Apache Tika at Scale
Lots of Data is Junk
• At scale, you're going to hit lots of edge cases
• At scale, you're going to come across lots of junk or
corrupted documents
• 1% of a lot is still a lot...
• 1% of the internet is a huge amount!
• Bound to find files which are unusual or corrupted enough
to be mis-identified
• You need to plan for failures!
Unusual Types
• If you're working on a big data scale, you're bound to come
across lots of valid but unusual + unknown files
• You're never going to be able to add support for all of
them!
• May be worth adding support for the more common
“uncommon” unsupported types
• Which means you'll need to track something about the files
you couldn't understand
• If Tika knows the mimetype but has no parser, just log the
mimetype
• If mimetype unknown, maybe log first few bytes
Failure at Scale
• Tika will sometimes mis-identify something, so sometimes
the wrong parser will run and object
• Some files will cause parsers or their underlying libraries to
do something silly, such as use lots of memory or get
into loops with lots to do
• Some files will cause parsers or their underlying libraries to
OOM, or infinite loop, or something else bad
• If a file fails once, will probably fail again, so blindly just re-
running that task again won't help
Failure at Scale, continued
• You'll need approaches that plan for failure
• Consider what will happen if a file locks up your JVM, or
kills it with an OOM
• Forked Parser may be worth using
• Running a separate Tika Server could be good
• Depending on work needed, could have a smaller pool of
Tika Server instances for big data code to call
• Think about failure modes, then think about retries (or not)
• Track common problems, report and fix them!
Bonus! Tika Batch, Eval & HadoopBonus! Tika Batch, Eval & Hadoop
Tika Batch – TIKA-1330
• Aiming to provide a robust Tika wrapper, that handles
OOMs, permanent hangs, out of file handles etc
• Should be able to use Tika Batch to run Tika against a wide
range of documents, getting either content or an error
• First focus was on the Tika App, with adisk-to-disk wrapper
• Now looking at the Tika Server, to have it log errors,
provide a watchdog to restart after serious errors etc
• Once that's all baked in, refactor and fully-hadoop!
• Accept there will always be errors! Work with that
Tika Batch Hadoop
• Now we have the basic Tika Batch working – Hadoop it!
• Aiming to provide a full Hadoop Tika Batch implementation
• Will process a large collection of files, providing either
Metadata+Content, or a detailed error of failure
• Failure could be machine/enivornment, so probably need to
retry a failure incase it isn't a Tika issue!
• Will be partly inspired by the work Apache Nutch does
•
• Tika will “eat our own dogfood” with this, using it to test for
regressions / improvements between versions
Tika Eval – TIKA-1332
• Building on top of Tika Batch, to work out how well / badly a
version of Tika does on a large collection of documents
• Provide comparible profiling of a run on a corpus
• Number of different file types found, number of exceptions,
exceptions by type and file type, attachements etc
• Also provide information on langauge stats, and junk text
• Identify file types to look at supporting
• Identify file types / exceptions which have regressed
• Identify exceptions / problems to try to fix
• Identify things for manual review, eg TIKA-1442 PDFBox bug
Batch+Eval+Public Datasets
• When looking at a new feature, or looking to upgrade a
dependency, we want to know if we have broken anything
• Unit tests provide a good first-pass, but only so many files
• Running against a very large dataset and comparing
before/after the best way to handle it
• Initially piloting + developing against the Govdocs1 corpus
http://digitalcorpora.org/corpora/govdocs
• Using donated hosting from Rackspace for trying this
• Need newer + more varied corpuses as well! Know of any?
Any Questions?Any Questions?
Nick Burch
@Gagravarr
nick@apache.org
Nick Burch
@Gagravarr
nick@apache.org

Mais conteúdo relacionado

Mais procurados

What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...Nandana Mihindukulasooriya
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 

Mais procurados (20)

Apache Tika
Apache TikaApache Tika
Apache Tika
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Lucene
LuceneLucene
Lucene
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 

Destaque

Apache con big data 2015 magellan
Apache con big data 2015 magellanApache con big data 2015 magellan
Apache con big data 2015 magellanRam Sriharsha
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Antidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenusAntidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenusAntidot
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitAldrin Piri
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
 
Upping your NiFi Game with Docker
Upping your NiFi Game with DockerUpping your NiFi Game with Docker
Upping your NiFi Game with DockerAldrin Piri
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 

Destaque (11)

Apache con big data 2015 magellan
Apache con big data 2015 magellanApache con big data 2015 magellan
Apache con big data 2015 magellan
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
ANTIDOT - De la « recherche fédérée de documents » au véritable « accès unifi...
ANTIDOT - De la « recherche fédérée de documents » au véritable « accès unifi...ANTIDOT - De la « recherche fédérée de documents » au véritable « accès unifi...
ANTIDOT - De la « recherche fédérée de documents » au véritable « accès unifi...
 
Antidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenusAntidot Content Classifier - Valorisez vos contenus
Antidot Content Classifier - Valorisez vos contenus
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop Summit
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
Upping your NiFi Game with Docker
Upping your NiFi Game with DockerUpping your NiFi Game with Docker
Upping your NiFi Game with Docker
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 

Semelhante a What's new with Apache Tika?

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologiesgagravarr
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
Markup languages and warp-speed documentation
Markup languages and warp-speed documentationMarkup languages and warp-speed documentation
Markup languages and warp-speed documentationLois Patterson
 
Lois Patterson: Markup Languages and Warp-Speed Documentation
Lois Patterson:  Markup Languages and Warp-Speed DocumentationLois Patterson:  Markup Languages and Warp-Speed Documentation
Lois Patterson: Markup Languages and Warp-Speed DocumentationJack Molisani
 
QueryPath, Mash-ups, and Web Services
QueryPath, Mash-ups, and Web ServicesQueryPath, Mash-ups, and Web Services
QueryPath, Mash-ups, and Web ServicesMatt Butcher
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 
DITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part IDITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part ISuite Solutions
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBAlluxio, Inc.
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio, Inc.
 

Semelhante a What's new with Apache Tika? (20)

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Markup languages and warp-speed documentation
Markup languages and warp-speed documentationMarkup languages and warp-speed documentation
Markup languages and warp-speed documentation
 
Lois Patterson: Markup Languages and Warp-Speed Documentation
Lois Patterson:  Markup Languages and Warp-Speed DocumentationLois Patterson:  Markup Languages and Warp-Speed Documentation
Lois Patterson: Markup Languages and Warp-Speed Documentation
 
QueryPath, Mash-ups, and Web Services
QueryPath, Mash-ups, and Web ServicesQueryPath, Mash-ups, and Web Services
QueryPath, Mash-ups, and Web Services
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
SortaSQL
SortaSQLSortaSQL
SortaSQL
 
DITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part IDITA Quick Start for Authors - Part I
DITA Quick Start for Authors - Part I
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
E-publishing
E-publishingE-publishing
E-publishing
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata ServicesAlluxio - Scalable Filesystem Metadata Services
Alluxio - Scalable Filesystem Metadata Services
 

Mais de gagravarr

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...gagravarr
 
The Apache Way
The Apache WayThe Apache Way
The Apache Waygagravarr
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?gagravarr
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?gagravarr
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 

Mais de gagravarr (8)

Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?
 
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzz...
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 

Último

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Último (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

What's new with Apache Tika?

  • 1. What's new with Apache Tika?What's new with Apache Tika?
  • 2. What's New with Apache Tika? Nick BurchNick Burch @Gagravarr@Gagravarr CTO, QuanticateCTO, Quanticate What's New with Apache Tika? Nick BurchNick Burch @Gagravarr@Gagravarr CTO, QuanticateCTO, Quanticate
  • 3. Tika, in a nutshell “small, yellow and leech-like, and probably the oddest thing in the Universe” • Like a Babel Fish for content! • Helps you work out what sort of thing your content (1s & 0s) is • Helps you extract the metadata from it, in a consistent way • Lets you get a plain text version of your content, eg for full text indexing • Provides a rich (XHTML) version too
  • 4. A bit of historyA bit of history
  • 5. Before Tika • In the early 2000s, everyone was building a search engine / search system for their CMS / web spider / etc • Lucene mailing list and wiki had lots of code snippets for using libraries to extract text • Lots of bugs, people using old versions, people missing out on useful formats, confusion abounded • Handful of commercial libraries, generally expensive and aimed at large companies and/or computer forensics • Everyone was re-inventing the wheel, and doing it badly....
  • 6. Tika's History (in brief) • The idea from Tika first came from the Apache Nutch project, who wanted to get useful things out of all the content they were spidering and indexing • The Apache Lucene project (which Nutch used) were also interested, as lots of people there had the same problems • Ideas and discussions started in 2006 • Project founded in 2007, in the Apache Incubator • Initial contributions from Nutch, Lucene and Lius • Graduated in 2008, v1.0 in 2011
  • 7. Tika Releases 27/12/2007 27/12/2009 27/12/2011 27/12/2013 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Releases
  • 8. A (brief) introduction to TikaA (brief) introduction to Tika
  • 9. (Some) Supported Formats • HTML, XHTML, XML • Microsoft Office – Word, Excel, PowerPoint, Works, Publisher, Visio – Binary and OOXML formats • OpenDocument (OpenOffice) • iWorks – Keynote, Pages, Numbers • PDF, RTF, Plain Text, CHM Help • Compression / Archive – Zip, Tar, Ar, 7z, bz2, gz etc • Atom, RSS, ePub Lots of Scientific formats • Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav • Image – JPEG, TIFF, PNG, BMP, GIF, ICO
  • 10. Detection • Work out what kind of file something is • Based on a mixture of things • Filename • Mime magic (first few hundred bytes) • Dedicated code (eg containers) • Some combination of all of these • Can be used as a standalone – what is this thing? • Can be combined with parsers – figure out what this is, then find a parser to work on it
  • 11. Metadata • Describes a file • eg Title, Author, Creation Date, Location • Tika provides a way to extract this (where present) • However, each file format tends to have its own kind of metadata, which can vary a lot • eg Author, Creator, Created By, First Author, Creator[0] • Tika tries to map file format specific metadata onto common, consistent metadata keys • “Give me the thing that closest represents what Dublin Core defines as Creator”
  • 12. Plain Text • Most file formats include at least some text • For a plain text file, that's everything in it! • For others, it's only part • Lots of libraries out there which can extract text, but how you call them varies a lot • Tika wraps all that up for you, and gives consistentency • Plain Text is ideal for things like Full Text Indexing, eg to feed into SOLR, Lucene or ElasticSearch
  • 13. XHTML • Structured Text extraction • Outputs SAX events for the tags and text of a file • This is actually the Tika default, Plain Text is implemented by only catching the Text parts of the SAX output • Isn't supposed to be the “exact representation” • Aims to give meaningful, semantic but simple output • Can be used for basic previews • Can be used to filter, eg ignore header + footer then give remainder as plain text
  • 16. Supported Formats • HTML • XML • Microsoft Office • Word • PowerPoint • Excel (2,3,4,5,97+) • Visio • Outlook
  • 17. Supported Formats • Open Document Format (ODF) • iWorks • PDF • ePUB • RTF • Tar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200 • Plain Text • RSS and Atom
  • 18. Supported Formats • IPTC ANPA Newswire • CHM Help • Wav, MIDI • MP3, MP4 Audio • Ogg Vorbis, Speex, FLAC, Opus • PNG, JPG, BMP, TIFF, BPG • FLV, MP4 Video • Java classes
  • 19. Supported Formats • Source Code • Mbox, RFC822, Outlook PST, Outlook MSG, TNEF • DWG CAD • DIF, GDAL, ISO-19139, Grib, HDF, ISA-Tab, NetCDF, Matlab • Executables (Windows, Linux, Mac) • Pkcs7 • SQLite • Microsoft Access
  • 21. OCR • What if you don't have a text file, but instead a photo of some text? Or a scan of some text? • OCR (Optical Character Recognition) to the rescue! • Tesseract is an Open Source OCR tool • Tika has a parser which'll call out to Tesseract for suitable images found • Tesseract is found and used if on the path • Explicit path can be given, or can be disabled
  • 24. Databases • A surprising number of Database and “database” systems have a single-file mode • If there's a single file, and a suitable library or program, then Tika can get the data out! • Main ones so far are MS Access & SQLite • How best to represent the contents in XHTML? • One HTML table per Database Table best we have, so far!
  • 25. Tika Config XMLTika Config XML
  • 26. Tika Config XML • Using Config, you can specify what Parsers, Detectors, Translator, Service Loader and Mime Types to use • You can do it explicitly • You can do it implicitly (with defaults) • You can do “default except” • Tools available to dump out a running config as XML • Use the Tika App to see what you have
  • 27. Tika Config XML example <?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/> </parser> <parser class="org.apache.tika.parser.EmptyParser"> <mime>application/pdf</mime> </parser> </parsers> </properties>
  • 38. Tika 1.11 • Library upgrades for bug fixes (POI, PDFBox etc) • Tika Config XML enhancements • Tika Config XML output / dumping • Apache Commons IO used more widely • GROBID • Hopefully due in a few weeks!
  • 40. Tika 1.12+ • Commons IO in Core? TBD • Java 7 Paths – where java.io.File used • More NLP enhancement / augmentation • Metadata aliasing • Plus preparations for Tika 2
  • 42. Why no Tika v2 yet? • Apache Tika 0.1 – December 2007 • Apache Tika 1.0 – November 2011 • Shouldn't we have had a v2 by now? • Discussions started several years ago, on the list • Plans for what we need on the wiki for ~1 year • Largely though, every time someone came up with a breaking feature for 2.0, a compatible way to do it was found!
  • 43. Deprecated Parts • Various parts of Tika have been deprecated over the years • All of those will go! • Main ones that might bite you: • Parser parse with no ParseContext • Old style Metadata keys
  • 44. Metadata Storage • Currently, Metadata in Tika is String Key/Value Lists • Many Metadata types have Properties, which provide typing, conversions, sanity checks etc • But all still stored as String Key + Value(s) • Some people think we need a richer storage model • Others want to keep it simple! • JSON, XML DOM, XMP being debated • Richer string keys also proposed
  • 45. Java Packaging of Tika • Maven Packages of Tika are • Tika Core • Tika Parsers • Tika Bundle • Tika XMP • Tika Java 7 • For just some parsers, you need to exclude maven dependencies • Should we have “Tika Parser PDF”, “Tika Parsers ODF” etc?
  • 46. Fallback/Preference Parsers • If we have several parsers that can handle a format • Preferences? • If one fails, how about trying others?
  • 47. Multiple Parsers • If we have several parsers that can handle a format • What about running all of them? • eg extract image metadata • then OCR it • then try a second parser for more metadata
  • 48. Parser Discovery/Loading? • Currently, Tika uses a Service Loader mechanism to find and load available Parsers (and Detectors+Translators) • This allows you to drop a new Tika parser jar onto the classpath, and have it automatically used • Also allows you to miss one or two jars out, and not get any content back with no warnings / errors... • You can set the Service Loader to Warn, or even Error • But most people don't, and it bites them! • Change the default in 2? Or change entirely how we do it?
  • 49. What we still need help with...What we still need help with...
  • 50. Content Handler Reset/Add • Tika uses the SAX Content Handler interface for supplying plain text • Streaming, write once • How does that work with multiple parsers?
  • 51. Content Enhancement • How can we post-process the content to “enhance” it in various ways? • For example, how can we mark up parts of speach? • Pull out information into the Metadata? • Translate it, retaining the original positions? • For just some formats, or for all? • For just some documents in some formats? • While still keeping the Streaming SAX-like contract?
  • 52. Metadata Standards • Currently, Tika works hard to map file-format-specific metadata onto general metadata standards • Means you don't have to know each standard in depth, can just say “give me the closest to dc:subject you have, no matter what file format or library it comes from” • What about non-File-format metadata, such as content metadata (Table of Contents, Author information etc)? • What about combining things?
  • 53. Richer Metadata • See Metadata Storage slides!
  • 54. Bonus! Apache Tika at ScaleBonus! Apache Tika at Scale
  • 55. Lots of Data is Junk • At scale, you're going to hit lots of edge cases • At scale, you're going to come across lots of junk or corrupted documents • 1% of a lot is still a lot... • 1% of the internet is a huge amount! • Bound to find files which are unusual or corrupted enough to be mis-identified • You need to plan for failures!
  • 56. Unusual Types • If you're working on a big data scale, you're bound to come across lots of valid but unusual + unknown files • You're never going to be able to add support for all of them! • May be worth adding support for the more common “uncommon” unsupported types • Which means you'll need to track something about the files you couldn't understand • If Tika knows the mimetype but has no parser, just log the mimetype • If mimetype unknown, maybe log first few bytes
  • 57. Failure at Scale • Tika will sometimes mis-identify something, so sometimes the wrong parser will run and object • Some files will cause parsers or their underlying libraries to do something silly, such as use lots of memory or get into loops with lots to do • Some files will cause parsers or their underlying libraries to OOM, or infinite loop, or something else bad • If a file fails once, will probably fail again, so blindly just re- running that task again won't help
  • 58. Failure at Scale, continued • You'll need approaches that plan for failure • Consider what will happen if a file locks up your JVM, or kills it with an OOM • Forked Parser may be worth using • Running a separate Tika Server could be good • Depending on work needed, could have a smaller pool of Tika Server instances for big data code to call • Think about failure modes, then think about retries (or not) • Track common problems, report and fix them!
  • 59. Bonus! Tika Batch, Eval & HadoopBonus! Tika Batch, Eval & Hadoop
  • 60. Tika Batch – TIKA-1330 • Aiming to provide a robust Tika wrapper, that handles OOMs, permanent hangs, out of file handles etc • Should be able to use Tika Batch to run Tika against a wide range of documents, getting either content or an error • First focus was on the Tika App, with adisk-to-disk wrapper • Now looking at the Tika Server, to have it log errors, provide a watchdog to restart after serious errors etc • Once that's all baked in, refactor and fully-hadoop! • Accept there will always be errors! Work with that
  • 61. Tika Batch Hadoop • Now we have the basic Tika Batch working – Hadoop it! • Aiming to provide a full Hadoop Tika Batch implementation • Will process a large collection of files, providing either Metadata+Content, or a detailed error of failure • Failure could be machine/enivornment, so probably need to retry a failure incase it isn't a Tika issue! • Will be partly inspired by the work Apache Nutch does • • Tika will “eat our own dogfood” with this, using it to test for regressions / improvements between versions
  • 62. Tika Eval – TIKA-1332 • Building on top of Tika Batch, to work out how well / badly a version of Tika does on a large collection of documents • Provide comparible profiling of a run on a corpus • Number of different file types found, number of exceptions, exceptions by type and file type, attachements etc • Also provide information on langauge stats, and junk text • Identify file types to look at supporting • Identify file types / exceptions which have regressed • Identify exceptions / problems to try to fix • Identify things for manual review, eg TIKA-1442 PDFBox bug
  • 63. Batch+Eval+Public Datasets • When looking at a new feature, or looking to upgrade a dependency, we want to know if we have broken anything • Unit tests provide a good first-pass, but only so many files • Running against a very large dataset and comparing before/after the best way to handle it • Initially piloting + developing against the Govdocs1 corpus http://digitalcorpora.org/corpora/govdocs • Using donated hosting from Rackspace for trying this • Need newer + more varied corpuses as well! Know of any?