SlideShare a Scribd company logo
1 of 46
Download to read offline
[ Rencontres Mondiales du Logiciel Libre 2010 - Thursday, July 8 ]

       Introduction to libre « fulltext »
                             technology
                                    Author : Ir Robert Viseur
Who am I?

• My name : Robert Viseur
  • Civil Engineer, Master in Management of Innovation.
  • Specific expertise in the economics of free software and
    practices of co-creation.
  • Manager of logiciellibre.com (directory of free software
    companies).
  • Assistant in the Department of Economics and Management
    of Innovation (mi.fpms.ac.be) of the Faculty of Engineering,
    University of Mons (www.umons.ac.be).
  • Technology advisor at CETIC (www.cetic.be).
     • Belgian ICT Research and Technology Transfer Centre.
     • Initiator of Cellavi project (Center of Expertise for Open Source use
       in Industrial Applications).
Introduction
What do we talk about?

• Limits of conventional DBMS for text searching.
• Technologies for fulltext in DBMS (MySQL fulltext,
  PostgreSQL tsearch, Sphynx Search for MySQL).
• Free indexers (Lucene family, Xapian).
• We will not speak about NoSQL databases
  (Cassandra, CouchDB, HBase,...).
Why is this useful?

• Look for articles in a CMS (Content Management
  System), for posts in a forum or for items in an
  online shop.
• Research in news, podcasts or RSS / Atom feeds.
• Research in the content of books or PDF papers.
• ...
Four steps
What are the important steps?

• Four important steps:
  •   the extraction,
  •   the indexing,
  •   the research,
  •   the presentation of results.
Step 1 : the extraction (1/2)

• Conversion of the file to index in plain text.
• Simple cases :
   • structured text files
      • Examples:
         – XML (with PHP::SimpleXML),
         – RSS (with PHP::SimplePie),
         – HTML (with PHP::strip_tags or HTML analyzer).
   • documented complex formats
      • Examples: ODF = XML compressed file (ZIP).
Step 1 : the extraction (2/2)

• Complex cases: undocumented binary formats.
  • Example: Office formats (97, 2000, XP ,...)
  • Use of Open Source projects:
     • Apache Jakarta POI (MS Office), Apache Tika (various
       documents), xls2csv (Microsoft Excel), catdoc (Microsoft
       Word), pdfinfo (PDF ),...
     • Extraction often imperfect (~ 20% error with POI).
  • Use IFilters (MS Windows).
     • Extensions proposed by the publishers themselves to extract
       the contents of files (Microsoft Office, Autocad, etc.).
• Scanned documents: OCR Open Source solutions.
  • See Tesseract, OCRAD and GOCR (still emerging).
Steps 2 & 3: indexing and search

• What everybody knows: SELECT ... WHERE ...
  LIKE ...
• What is less known: the regular expressions.
• What you may be tempted to do: « do it
  yourself ».
• What it is instead recommended to do: use
  standard technologies.
Step 4: presentation

• Export results to XML (API: OpenSearch standard
  format)
• XSL transformations to RSS, Atom,...
• Spellchecking (eg with Aspell or with algorithms
  like Soundex or Levenshtein distance)
• Classification of results (eg with Reverend).
« Do it yourself » approaches
What happens if there are no
                 fulltext solutions ? (1/2)
• This is particularly the case in SQLite.
• The most famous: LIKE
• Example: SELECT news.title, news.url FROM news
  WHERE news.title LIKE '%linux%'
• Possible improvements: decomposition of research
  in tokens, regular expression filtering, ...
• Disadvantage: problem of relevance, not suitable
  for large volumes of data.
What happens if there are no
                        fulltext solutions ? (2/2)
• The attached functions or regular expressions in SQL:
   • Example (PHP, SQLite):
      • sqlite_create_function ($db, 'sqlite_fulltext', 'sqlite_fulltext', 2);
      • $sql = "SELECT * FROM torrents WHERE sqlite_fulltext (Search,".
        Sqlite_escape_string ($ q )."')== 1 ORDER BY Title ";
      • In "sqlite_fulltext", using: preg_match ("/b($word)b/i", $text).
   • Disadvantage: not suitable for large volumes of data.
• Regular expressions also supported by MySQL, PostgreSQL
  and Firebird (since 2.5).
Is it possible to do by yourself?
                                        (1/2)
• Yes, but difficult development.
   • Create the dictionary:
      • Filtering of text by removing non-alphanumeric characters.
      • Decomposition of text filtered "terms" (tokens).
      • Removal of black words (the, the, the, my, your, their,
        our, your, their,...).
      • Establishment of a correspondence table (identifier of the
        document, term).
         – Each term is associated with a list of documents containing
           that term.
   • Write the good SQL requests...
Is it possible to do by yourself?
                                        (2/2)
• Possible improvements:
  • Lemmatization terms.
     • Each term is replaced by its canonical form.
         – Example: studying, studying, student, students, ... => "study".
     • Several open source implementations of the Porter algorithm
       (eg Snowball).
  • Associations of the terms or lemmas with a phonetic
    form (soundex, Metaphone, etc.).
  • Warning: stemming and phonetic forms depend on
    the language.
     • Automatic language detection (eg in PHP :
       Text_LanguageDetect in PEAR).
Fulltext standard solutions for DBMS
What exists?

• SQLite: not included as standard (?) but SQLite FTS3
  extension (untested).
• Firebird (untested): no standard module but extensions
  (including Sphinx Search).
• MySQL: MySQL FULLTEXT standard module and
  extensions (including Sphinx Search).
• PostgreSQL: PostgreSQL standard module tsearch
  (standard since v8.3).
• Other: Senna (untested).
  • Triton = MySQL + Senna | Ludia = PostgreSQL + Senna.
  • Cfr. Kazuhiko Shiozaki (Solutions Linux 2008).
MySQL fulltext (1/2)

• MySQL provides an automatic fulltext mode.
  • Creation:
     • CREATE TABLE news ( id INT UNSIGNED AUTO_INCREMENT
       NOT NULL PRIMARY KEY, title VARCHAR(256), body TEXT,
       FULLTEXT (title, body) )
  • Selection:
     • SELECT id, title, body, MATCH (title,body) AGAINST ('linux')
       AS score FROM news WHERE MATCH (title,body) AGAINST
       ('linux') ORDER BY score
MySQL Fulltext (2/2)

• Strengths:
  • Supported on most shared web hosting services;
  • Support the creation of the dictionary and the analysis of the
    request;
  • Availability of search operators,
  • Evaluation of a score of relevance,
  • Mechanism for query expansion, ...
• Weaknesses:
  • No control over the analysis of the text (tokenisation but not
    stemming)
  • Minimum size of tokens (terms) set by default to 4
    characters (not editable on a shared web hosting service).
MySQL with Sphinx Search

• Sphinx Search extension must be compiled for
  MySQL.
• Support for PostgreSQL.
• Used by craiglist.org, mininova.org, ...
• Strengths: support very large volumes of data (>
  100 GB of text), storage always provided by
  MySQL, portable.
• Weaknesses: stemming limited to English and
  Russian.
PostgreSQL tsearch (1/2)

• PostgreSQL offers an automatic fulltext mode.
  • Creation:
     • ALTER TABLE tpages ADD COLUMN vecteur tsvector;
       UPDATE tpages SET vecteur=to_tsvector(contenu);
  • Selection:
     • SELECT * FROM docs WHERE vecteur @@
       to_tsquery('tsearch2');
• Advanced Features:
  • Score of relevance with ts_rank.
  • Creation of "snippet" with ts_headline.
PostgreSQL tsearch (2/2)

• Strengths: query parsing, stemming function of
  language.
• Weakness: not supported on shared web hosting
  services.
Indexers
What exists?

• Lucene (and its multiple ports),
• Other: Xapian,...
Xapian

• Fork of Open Muscat (BrightStation PLC); C ++,
  GPL.
• Strengths: many bindings, import filters
  (extraction), stemming (many languages
  supported), synonymy (extensions of requests),
  correction of requests, support for indexing SQL
  databases (MySQL, PostgreSQL, SQLite, Oracle,
  DB2, MS SQL, LDAP and ODBC).
• Weaknesses: less popular.
Lucene (1/3)

• Supported by the Apache Foundation.
• Wide ecosystem:
  • Used in Alfresco, Jahia ...
  • Multiple integrations (eg CouchDB-lucene).
  • Many third-party tools: Luke (read index), Solr
    (search server; without crawler), Nutch (search
    engine with crawler), Carrot² (search interface
    compatible with OpenSearch and Solr),...
• Lucene index format becomes a kind of standard.
Lucene (2/3)

• Many ports (Perl, Python, .Net,...).
   • Lucene.Net (. Net) PyLucene (Python), CLucene
     (C++) Plucene (Perl), Zend Search (PHP).
   • Warning: functional coverage, release of supported
     index!
• Three types of port:
   • by literal translation (API compatible),
   • translation adapted for the target language (best
     performance),
   • by binding (for Python).
Lucene (3/3)

• Ability to change the text analyzers.
• Access to the dictionary of terms.
• Multiple search operators (AND, OR, NOT, +, -, ?,
  *, ...).
• Exact or fuzzy search, management of
  synonyms, ...
• Ability to search by fields (eg title: linux).
• Ability to sort by field.
Some tests
What was tested?

• Taking into account:
  •   the speed of index creation,
  •   the speed of insertion
  •   the speed of removal,
  •   the speed of search.
• No systematic consideration of relevance.
• Two sets of data:
  • 20,000 textual data from 1kB to 900kB,
  • 200,000 textual data from 2kB to 5kB.
MySQL, PostgreSQL, Sphinx Search
                           (2008) (1/2)
• MySQL:
  • smaller index,
  • slower search compared to PostgreSQL or Sphinx
    Search,
  • deletion is slow
  • Insertion is very slow with bigger data.
• PostgreSQL:
  • index creation is slow,
  • insertion is very slow with bigger data.
MySQL, PostgreSQL, Sphinx Search
                           (2008) (2/2)
• Sphinx Search (with MySQL) :
  • manual (re)indexation (but very fast),
  • fast searches,
  • relatively insensitive to data size.
Xapian, Lucene, PyLucene,
                        Lucene.Net (2008)
• Xapian:
  • Slow when creating or updating the index, large
    index (compared to Lucene);
  • Installation more difficult.
• Lucene:
  • Performance fairly homogeneous (Lucene, PyLucene
    and Lucene.Net);
  • PyLucene significantly slower in creating and
    updating of the index (why?).
Zend Search (2010)

• PHP technology built into the Zend framework.
• Easily hostable.
• Very useful for small volumes of data.
• Fragility of index (corruption under heavy
  solicitation in insertion).
• Example: www.retronimo.com (search RSS).
Discussion and conclusion
Which technology to choose? (1/2)

• Database or index?
  • Indexer whether purely textual data.
  • Database if:
     • Structured data,
     • Need of relational model,
     • Need of SQL language.
Which technology to choose? (2/2)

• Databases:
  • MySQL: well suited for basic solutions (relevance average,
    good performance on small data sizes), easily hostable,
    integrated platform LAMP / MAMP / WAMP.
  • PostgreSQL: well suited for professional solutions (but
    avoid with bigger data).
  • Sphinx Search: suitable for large volumes of data of any
    size.
• Indexers:
  • Lucene confirms its reputation as a reference.
  • Zend Search only useful for small volumes of data.
Thanks!



Thank you for your attention.

         Questions?
Some resources
Tools (1/2)

•   SQLite (www.sqlite.org).
•   MySQL (www.mysql.com).
•   WampServer (www.wampserver.com).
•   Sphinx Search (www.sphinxsearch.com).
•   PostgreSQL (www.postgresql.org).
•   Tritonn (qwik.jp/tritonn/).
•   Lucene (lucene.apache.org).
•   Zend framework (framework.zend.com).
•   Xapian (xapian.org).
Tools (2/2)

•   SolR (lucene.apache.org/solr/).
•   Carrot² (project.carrot2.org).
•   Luke (www.getopt.org/luke/).
•   Nutch (nutch.apache.org).
•   Tesseract (tesseract-ocr.googlecode.com).
•   Apache POI (poi.apache.org).
•   Snowball (snowball.tartarus.org).
•   Reverend
    (divmod.org/trac/wiki/DivmodReverend).
Resources and useful links (1/3)

• Justine Demeyer (stagiaire), Robert Viseur (maître de stage) et
  Tom Mens (directeur de stage) (2008). Comparaison de
  technologies d'indexation fulltext. UMons / CETIC, 2008.
• Robert Viseur (2008). "Solutions Linux: session sur l'indexation
  fulltext dans les SGBD". URL:
  http://www.robertviseur.be/news-20080222.php .
• Robert Viseur (2008). "Atelier de présentation du mode
  FULLTEXT de PostgreSQL 8.3 aux RMLL 2008". URL:
  http://www.robertviseur.be/news-20080728.php .
• Robert Viseur (2009). "Première comparaison de Tesseract,
  OCRAD, GOCR et... PhpOCR". URL:
  http://www.robertviseur.be/news-20080726.php .
Resources and useful links (2/3)

• Erik Hatcher et Otis Gospodnetić (2004). "Lucene in
  Action". Manning Publications Co.
• "Annexe F. Expressions régulières MySQL". URL:
  http://dev.mysql.com/doc/refman/5.0/fr/regexp.html .
• "9.7. Pattern Matching". URL: http://www.regular-
  expressions.info/postgresql.html .
• Philippe Makowski (2009). "Firebird 2.5, les principales
  nouveautés". Code way 3. 16-20 novembre 2009. URL:
  http://www.firebirdsql.org/download/rabbits/pmakowski
  /firebird-25.pdf .
• "Does Firebird support full-text search?". URL:
  http://www.firebirdfaq.org/faq328/ .
Resources and useful links (3/3)

• Björn Reimer & Dirk Baumeister (2006). "Full text search in Firebird without a full
  text search engine". Firebird Conference Prague 2006. URL:
  http://www.ibphoenix.com/downloads/FirebirdConf2006/TECH-TPZ303-R/TECH-
  TPZ303-R.zip .
• "SQLite FTS3 Extension". URL: http://www.sqlite.org/fts3.html .
• "Full-Text Search on SQLite". URL: http://michaeltrier.com/2008/7/13/full-text-
  search-on-sqlite .
• "Tritonn - MySQL with Senna". Sumisho Computer Systems Corporation Brazil, Inc.
  URL: http://qwik.jp/tritonn/about_en.files/tritonn-eng.pdf .
• Kazuhiko Shiozaki (2008). "Moteurs plein texte sous MySQL et PostgreSQL pour la
  gestion de connaissances". Solutions Linux, 2008. URL:
  http://www.robertviseur.be/news-20080222.php .
Contact

• Ir. Robert Viseur.
• Email : robert.viseur@cetic.be
• Phone : 0032 (0) 479 66 08 76

More Related Content

What's hot

Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8marctritschler
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muirlucenerevolution
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceLucidworks
 
14 file handling
14 file handling14 file handling
14 file handlingAPU
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 

What's hot (20)

Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Curious Case of SQLi
Curious Case of SQLiCurious Case of SQLi
Curious Case of SQLi
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
14 file handling
14 file handling14 file handling
14 file handling
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 

Similar to Introduction to libre « fulltext » technology

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Black Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data RetrievalBlack Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data Retrievalqqlan
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaChris Mattmann
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1Stefan Schmidt
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesDave Lewis
 

Similar to Introduction to libre « fulltext » technology (20)

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Black Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data RetrievalBlack Hat: XML Out-Of-Band Data Retrieval
Black Hat: XML Out-Of-Band Data Retrieval
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Php
PhpPhp
Php
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New Technologies
 

More from Robert Viseur

La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...Robert Viseur
 
L'écosystème régional du Big Data
L'écosystème régional du Big DataL'écosystème régional du Big Data
L'écosystème régional du Big DataRobert Viseur
 
Piloter son appareil photo numérique avec des logiciels libres
Piloter son appareil photo  numérique avec des logiciels  libresPiloter son appareil photo  numérique avec des logiciels  libres
Piloter son appareil photo numérique avec des logiciels libresRobert Viseur
 
Exploiter les données issues de Wikipedia
Exploiter les données issues de WikipediaExploiter les données issues de Wikipedia
Exploiter les données issues de WikipediaRobert Viseur
 
De l’open source à l’open cloud
De l’open source à l’open cloudDe l’open source à l’open cloud
De l’open source à l’open cloudRobert Viseur
 
Développer ses photos avec RawTherapee
Développer ses photos avec RawTherapeeDévelopper ses photos avec RawTherapee
Développer ses photos avec RawTherapeeRobert Viseur
 
Convertir ses photos en N/B avec Gimp
Convertir ses photos en N/B avec GimpConvertir ses photos en N/B avec Gimp
Convertir ses photos en N/B avec GimpRobert Viseur
 
L'open hardware : l'ouverture au service de l'innovation
L'open hardware : l'ouverture au service de l'innovationL'open hardware : l'ouverture au service de l'innovation
L'open hardware : l'ouverture au service de l'innovationRobert Viseur
 
Pechakucha (Mons) : Street Art à Mons
Pechakucha (Mons) : Street Art à MonsPechakucha (Mons) : Street Art à Mons
Pechakucha (Mons) : Street Art à MonsRobert Viseur
 
L'open hardware dans l'électronique (et au delà...)
L'open hardware dans l'électronique (et au delà...)L'open hardware dans l'électronique (et au delà...)
L'open hardware dans l'électronique (et au delà...)Robert Viseur
 
Analyse des concepts de Fab Lab, Living Lab et Hub créatif
Analyse des concepts de Fab Lab, Living Lab et Hub créatifAnalyse des concepts de Fab Lab, Living Lab et Hub créatif
Analyse des concepts de Fab Lab, Living Lab et Hub créatifRobert Viseur
 
Open Source Hardware for Dummies
Open Source Hardware for DummiesOpen Source Hardware for Dummies
Open Source Hardware for DummiesRobert Viseur
 
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...Robert Viseur
 
Etude du secteur des prestataires FLOSS en Belgique
Etude du secteur des prestataires FLOSS en BelgiqueEtude du secteur des prestataires FLOSS en Belgique
Etude du secteur des prestataires FLOSS en BelgiqueRobert Viseur
 
Hacker son appareil photo avec des outils libres
Hacker son appareil photo avec des outils libresHacker son appareil photo avec des outils libres
Hacker son appareil photo avec des outils libresRobert Viseur
 
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...Robert Viseur
 
Hacker son appareil photo, c'est possible !
Hacker son appareil photo, c'est possible !Hacker son appareil photo, c'est possible !
Hacker son appareil photo, c'est possible !Robert Viseur
 
Comprendre les licences de logiciels libres
Comprendre les licences de logiciels libresComprendre les licences de logiciels libres
Comprendre les licences de logiciels libresRobert Viseur
 
Impact of cloud computing on FOSS editors
Impact of cloud computing on FOSS editorsImpact of cloud computing on FOSS editors
Impact of cloud computing on FOSS editorsRobert Viseur
 
Une introduction à la co-création dans le domaine des TIC
Une introduction à la co-création dans le domaine des TICUne introduction à la co-création dans le domaine des TIC
Une introduction à la co-création dans le domaine des TICRobert Viseur
 

More from Robert Viseur (20)

La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
La PI dans les espaces de co-création et d'innovation ouverte. Propriété inte...
 
L'écosystème régional du Big Data
L'écosystème régional du Big DataL'écosystème régional du Big Data
L'écosystème régional du Big Data
 
Piloter son appareil photo numérique avec des logiciels libres
Piloter son appareil photo  numérique avec des logiciels  libresPiloter son appareil photo  numérique avec des logiciels  libres
Piloter son appareil photo numérique avec des logiciels libres
 
Exploiter les données issues de Wikipedia
Exploiter les données issues de WikipediaExploiter les données issues de Wikipedia
Exploiter les données issues de Wikipedia
 
De l’open source à l’open cloud
De l’open source à l’open cloudDe l’open source à l’open cloud
De l’open source à l’open cloud
 
Développer ses photos avec RawTherapee
Développer ses photos avec RawTherapeeDévelopper ses photos avec RawTherapee
Développer ses photos avec RawTherapee
 
Convertir ses photos en N/B avec Gimp
Convertir ses photos en N/B avec GimpConvertir ses photos en N/B avec Gimp
Convertir ses photos en N/B avec Gimp
 
L'open hardware : l'ouverture au service de l'innovation
L'open hardware : l'ouverture au service de l'innovationL'open hardware : l'ouverture au service de l'innovation
L'open hardware : l'ouverture au service de l'innovation
 
Pechakucha (Mons) : Street Art à Mons
Pechakucha (Mons) : Street Art à MonsPechakucha (Mons) : Street Art à Mons
Pechakucha (Mons) : Street Art à Mons
 
L'open hardware dans l'électronique (et au delà...)
L'open hardware dans l'électronique (et au delà...)L'open hardware dans l'électronique (et au delà...)
L'open hardware dans l'électronique (et au delà...)
 
Analyse des concepts de Fab Lab, Living Lab et Hub créatif
Analyse des concepts de Fab Lab, Living Lab et Hub créatifAnalyse des concepts de Fab Lab, Living Lab et Hub créatif
Analyse des concepts de Fab Lab, Living Lab et Hub créatif
 
Open Source Hardware for Dummies
Open Source Hardware for DummiesOpen Source Hardware for Dummies
Open Source Hardware for Dummies
 
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
Pratiques innovantes dans le secteur automobile: du champion de produit à l'i...
 
Etude du secteur des prestataires FLOSS en Belgique
Etude du secteur des prestataires FLOSS en BelgiqueEtude du secteur des prestataires FLOSS en Belgique
Etude du secteur des prestataires FLOSS en Belgique
 
Hacker son appareil photo avec des outils libres
Hacker son appareil photo avec des outils libresHacker son appareil photo avec des outils libres
Hacker son appareil photo avec des outils libres
 
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
Comment gérer le risque de lock-in technique en cas d'usage de services de cl...
 
Hacker son appareil photo, c'est possible !
Hacker son appareil photo, c'est possible !Hacker son appareil photo, c'est possible !
Hacker son appareil photo, c'est possible !
 
Comprendre les licences de logiciels libres
Comprendre les licences de logiciels libresComprendre les licences de logiciels libres
Comprendre les licences de logiciels libres
 
Impact of cloud computing on FOSS editors
Impact of cloud computing on FOSS editorsImpact of cloud computing on FOSS editors
Impact of cloud computing on FOSS editors
 
Une introduction à la co-création dans le domaine des TIC
Une introduction à la co-création dans le domaine des TICUne introduction à la co-création dans le domaine des TIC
Une introduction à la co-création dans le domaine des TIC
 

Introduction to libre « fulltext » technology

  • 1. [ Rencontres Mondiales du Logiciel Libre 2010 - Thursday, July 8 ] Introduction to libre « fulltext » technology Author : Ir Robert Viseur
  • 2. Who am I? • My name : Robert Viseur • Civil Engineer, Master in Management of Innovation. • Specific expertise in the economics of free software and practices of co-creation. • Manager of logiciellibre.com (directory of free software companies). • Assistant in the Department of Economics and Management of Innovation (mi.fpms.ac.be) of the Faculty of Engineering, University of Mons (www.umons.ac.be). • Technology advisor at CETIC (www.cetic.be). • Belgian ICT Research and Technology Transfer Centre. • Initiator of Cellavi project (Center of Expertise for Open Source use in Industrial Applications).
  • 4. What do we talk about? • Limits of conventional DBMS for text searching. • Technologies for fulltext in DBMS (MySQL fulltext, PostgreSQL tsearch, Sphynx Search for MySQL). • Free indexers (Lucene family, Xapian). • We will not speak about NoSQL databases (Cassandra, CouchDB, HBase,...).
  • 5. Why is this useful? • Look for articles in a CMS (Content Management System), for posts in a forum or for items in an online shop. • Research in news, podcasts or RSS / Atom feeds. • Research in the content of books or PDF papers. • ...
  • 7. What are the important steps? • Four important steps: • the extraction, • the indexing, • the research, • the presentation of results.
  • 8. Step 1 : the extraction (1/2) • Conversion of the file to index in plain text. • Simple cases : • structured text files • Examples: – XML (with PHP::SimpleXML), – RSS (with PHP::SimplePie), – HTML (with PHP::strip_tags or HTML analyzer). • documented complex formats • Examples: ODF = XML compressed file (ZIP).
  • 9. Step 1 : the extraction (2/2) • Complex cases: undocumented binary formats. • Example: Office formats (97, 2000, XP ,...) • Use of Open Source projects: • Apache Jakarta POI (MS Office), Apache Tika (various documents), xls2csv (Microsoft Excel), catdoc (Microsoft Word), pdfinfo (PDF ),... • Extraction often imperfect (~ 20% error with POI). • Use IFilters (MS Windows). • Extensions proposed by the publishers themselves to extract the contents of files (Microsoft Office, Autocad, etc.). • Scanned documents: OCR Open Source solutions. • See Tesseract, OCRAD and GOCR (still emerging).
  • 10. Steps 2 & 3: indexing and search • What everybody knows: SELECT ... WHERE ... LIKE ... • What is less known: the regular expressions. • What you may be tempted to do: « do it yourself ». • What it is instead recommended to do: use standard technologies.
  • 11. Step 4: presentation • Export results to XML (API: OpenSearch standard format) • XSL transformations to RSS, Atom,... • Spellchecking (eg with Aspell or with algorithms like Soundex or Levenshtein distance) • Classification of results (eg with Reverend).
  • 13. What happens if there are no fulltext solutions ? (1/2) • This is particularly the case in SQLite. • The most famous: LIKE • Example: SELECT news.title, news.url FROM news WHERE news.title LIKE '%linux%' • Possible improvements: decomposition of research in tokens, regular expression filtering, ... • Disadvantage: problem of relevance, not suitable for large volumes of data.
  • 14. What happens if there are no fulltext solutions ? (2/2) • The attached functions or regular expressions in SQL: • Example (PHP, SQLite): • sqlite_create_function ($db, 'sqlite_fulltext', 'sqlite_fulltext', 2); • $sql = "SELECT * FROM torrents WHERE sqlite_fulltext (Search,". Sqlite_escape_string ($ q )."')== 1 ORDER BY Title "; • In "sqlite_fulltext", using: preg_match ("/b($word)b/i", $text). • Disadvantage: not suitable for large volumes of data. • Regular expressions also supported by MySQL, PostgreSQL and Firebird (since 2.5).
  • 15. Is it possible to do by yourself? (1/2) • Yes, but difficult development. • Create the dictionary: • Filtering of text by removing non-alphanumeric characters. • Decomposition of text filtered "terms" (tokens). • Removal of black words (the, the, the, my, your, their, our, your, their,...). • Establishment of a correspondence table (identifier of the document, term). – Each term is associated with a list of documents containing that term. • Write the good SQL requests...
  • 16. Is it possible to do by yourself? (2/2) • Possible improvements: • Lemmatization terms. • Each term is replaced by its canonical form. – Example: studying, studying, student, students, ... => "study". • Several open source implementations of the Porter algorithm (eg Snowball). • Associations of the terms or lemmas with a phonetic form (soundex, Metaphone, etc.). • Warning: stemming and phonetic forms depend on the language. • Automatic language detection (eg in PHP : Text_LanguageDetect in PEAR).
  • 18. What exists? • SQLite: not included as standard (?) but SQLite FTS3 extension (untested). • Firebird (untested): no standard module but extensions (including Sphinx Search). • MySQL: MySQL FULLTEXT standard module and extensions (including Sphinx Search). • PostgreSQL: PostgreSQL standard module tsearch (standard since v8.3). • Other: Senna (untested). • Triton = MySQL + Senna | Ludia = PostgreSQL + Senna. • Cfr. Kazuhiko Shiozaki (Solutions Linux 2008).
  • 19. MySQL fulltext (1/2) • MySQL provides an automatic fulltext mode. • Creation: • CREATE TABLE news ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(256), body TEXT, FULLTEXT (title, body) ) • Selection: • SELECT id, title, body, MATCH (title,body) AGAINST ('linux') AS score FROM news WHERE MATCH (title,body) AGAINST ('linux') ORDER BY score
  • 20. MySQL Fulltext (2/2) • Strengths: • Supported on most shared web hosting services; • Support the creation of the dictionary and the analysis of the request; • Availability of search operators, • Evaluation of a score of relevance, • Mechanism for query expansion, ... • Weaknesses: • No control over the analysis of the text (tokenisation but not stemming) • Minimum size of tokens (terms) set by default to 4 characters (not editable on a shared web hosting service).
  • 21. MySQL with Sphinx Search • Sphinx Search extension must be compiled for MySQL. • Support for PostgreSQL. • Used by craiglist.org, mininova.org, ... • Strengths: support very large volumes of data (> 100 GB of text), storage always provided by MySQL, portable. • Weaknesses: stemming limited to English and Russian.
  • 22. PostgreSQL tsearch (1/2) • PostgreSQL offers an automatic fulltext mode. • Creation: • ALTER TABLE tpages ADD COLUMN vecteur tsvector; UPDATE tpages SET vecteur=to_tsvector(contenu); • Selection: • SELECT * FROM docs WHERE vecteur @@ to_tsquery('tsearch2'); • Advanced Features: • Score of relevance with ts_rank. • Creation of "snippet" with ts_headline.
  • 23. PostgreSQL tsearch (2/2) • Strengths: query parsing, stemming function of language. • Weakness: not supported on shared web hosting services.
  • 25. What exists? • Lucene (and its multiple ports), • Other: Xapian,...
  • 26. Xapian • Fork of Open Muscat (BrightStation PLC); C ++, GPL. • Strengths: many bindings, import filters (extraction), stemming (many languages supported), synonymy (extensions of requests), correction of requests, support for indexing SQL databases (MySQL, PostgreSQL, SQLite, Oracle, DB2, MS SQL, LDAP and ODBC). • Weaknesses: less popular.
  • 27. Lucene (1/3) • Supported by the Apache Foundation. • Wide ecosystem: • Used in Alfresco, Jahia ... • Multiple integrations (eg CouchDB-lucene). • Many third-party tools: Luke (read index), Solr (search server; without crawler), Nutch (search engine with crawler), Carrot² (search interface compatible with OpenSearch and Solr),... • Lucene index format becomes a kind of standard.
  • 28. Lucene (2/3) • Many ports (Perl, Python, .Net,...). • Lucene.Net (. Net) PyLucene (Python), CLucene (C++) Plucene (Perl), Zend Search (PHP). • Warning: functional coverage, release of supported index! • Three types of port: • by literal translation (API compatible), • translation adapted for the target language (best performance), • by binding (for Python).
  • 29. Lucene (3/3) • Ability to change the text analyzers. • Access to the dictionary of terms. • Multiple search operators (AND, OR, NOT, +, -, ?, *, ...). • Exact or fuzzy search, management of synonyms, ... • Ability to search by fields (eg title: linux). • Ability to sort by field.
  • 31. What was tested? • Taking into account: • the speed of index creation, • the speed of insertion • the speed of removal, • the speed of search. • No systematic consideration of relevance. • Two sets of data: • 20,000 textual data from 1kB to 900kB, • 200,000 textual data from 2kB to 5kB.
  • 32. MySQL, PostgreSQL, Sphinx Search (2008) (1/2) • MySQL: • smaller index, • slower search compared to PostgreSQL or Sphinx Search, • deletion is slow • Insertion is very slow with bigger data. • PostgreSQL: • index creation is slow, • insertion is very slow with bigger data.
  • 33. MySQL, PostgreSQL, Sphinx Search (2008) (2/2) • Sphinx Search (with MySQL) : • manual (re)indexation (but very fast), • fast searches, • relatively insensitive to data size.
  • 34. Xapian, Lucene, PyLucene, Lucene.Net (2008) • Xapian: • Slow when creating or updating the index, large index (compared to Lucene); • Installation more difficult. • Lucene: • Performance fairly homogeneous (Lucene, PyLucene and Lucene.Net); • PyLucene significantly slower in creating and updating of the index (why?).
  • 35. Zend Search (2010) • PHP technology built into the Zend framework. • Easily hostable. • Very useful for small volumes of data. • Fragility of index (corruption under heavy solicitation in insertion). • Example: www.retronimo.com (search RSS).
  • 37. Which technology to choose? (1/2) • Database or index? • Indexer whether purely textual data. • Database if: • Structured data, • Need of relational model, • Need of SQL language.
  • 38. Which technology to choose? (2/2) • Databases: • MySQL: well suited for basic solutions (relevance average, good performance on small data sizes), easily hostable, integrated platform LAMP / MAMP / WAMP. • PostgreSQL: well suited for professional solutions (but avoid with bigger data). • Sphinx Search: suitable for large volumes of data of any size. • Indexers: • Lucene confirms its reputation as a reference. • Zend Search only useful for small volumes of data.
  • 39. Thanks! Thank you for your attention. Questions?
  • 41. Tools (1/2) • SQLite (www.sqlite.org). • MySQL (www.mysql.com). • WampServer (www.wampserver.com). • Sphinx Search (www.sphinxsearch.com). • PostgreSQL (www.postgresql.org). • Tritonn (qwik.jp/tritonn/). • Lucene (lucene.apache.org). • Zend framework (framework.zend.com). • Xapian (xapian.org).
  • 42. Tools (2/2) • SolR (lucene.apache.org/solr/). • Carrot² (project.carrot2.org). • Luke (www.getopt.org/luke/). • Nutch (nutch.apache.org). • Tesseract (tesseract-ocr.googlecode.com). • Apache POI (poi.apache.org). • Snowball (snowball.tartarus.org). • Reverend (divmod.org/trac/wiki/DivmodReverend).
  • 43. Resources and useful links (1/3) • Justine Demeyer (stagiaire), Robert Viseur (maître de stage) et Tom Mens (directeur de stage) (2008). Comparaison de technologies d'indexation fulltext. UMons / CETIC, 2008. • Robert Viseur (2008). "Solutions Linux: session sur l'indexation fulltext dans les SGBD". URL: http://www.robertviseur.be/news-20080222.php . • Robert Viseur (2008). "Atelier de présentation du mode FULLTEXT de PostgreSQL 8.3 aux RMLL 2008". URL: http://www.robertviseur.be/news-20080728.php . • Robert Viseur (2009). "Première comparaison de Tesseract, OCRAD, GOCR et... PhpOCR". URL: http://www.robertviseur.be/news-20080726.php .
  • 44. Resources and useful links (2/3) • Erik Hatcher et Otis Gospodnetić (2004). "Lucene in Action". Manning Publications Co. • "Annexe F. Expressions régulières MySQL". URL: http://dev.mysql.com/doc/refman/5.0/fr/regexp.html . • "9.7. Pattern Matching". URL: http://www.regular- expressions.info/postgresql.html . • Philippe Makowski (2009). "Firebird 2.5, les principales nouveautés". Code way 3. 16-20 novembre 2009. URL: http://www.firebirdsql.org/download/rabbits/pmakowski /firebird-25.pdf . • "Does Firebird support full-text search?". URL: http://www.firebirdfaq.org/faq328/ .
  • 45. Resources and useful links (3/3) • Björn Reimer & Dirk Baumeister (2006). "Full text search in Firebird without a full text search engine". Firebird Conference Prague 2006. URL: http://www.ibphoenix.com/downloads/FirebirdConf2006/TECH-TPZ303-R/TECH- TPZ303-R.zip . • "SQLite FTS3 Extension". URL: http://www.sqlite.org/fts3.html . • "Full-Text Search on SQLite". URL: http://michaeltrier.com/2008/7/13/full-text- search-on-sqlite . • "Tritonn - MySQL with Senna". Sumisho Computer Systems Corporation Brazil, Inc. URL: http://qwik.jp/tritonn/about_en.files/tritonn-eng.pdf . • Kazuhiko Shiozaki (2008). "Moteurs plein texte sous MySQL et PostgreSQL pour la gestion de connaissances". Solutions Linux, 2008. URL: http://www.robertviseur.be/news-20080222.php .
  • 46. Contact • Ir. Robert Viseur. • Email : robert.viseur@cetic.be • Phone : 0032 (0) 479 66 08 76