SlideShare a Scribd company logo
1 of 47
Download to read offline
Patrick Beaucamp
Founder of the Vanilla Project
Mail : Patrick.beaucamp@bpm-conseil.com
Custom Open Source Search Engine with Drupal 8
and Solr at French Ministry of Environment
II-SDV, Nice 24th April 2017
1II-SDV, Nice
Presentation Agenda
Open Source Search Engine & Search Platform
Some interesting Platforms
Features expected for Search Platforms (Interface)
2II-SDV, Nice
Open Source Platform at French Ministry
Project Context
Platform Architecture
WebSite Powered by a Search engine
Echo : Tuesday am, presentation from Deep Search 9 and
Tuesday pm prssentation from FranceLabs
Personal Experience of Search
Searching … and finding !
II-SDV : SEARCH, DATA MINING and
VISUALISATION
3II-SDV, Nice
How many times per day do you Google ? (search,
maps, translate …)
Tribute to Open Source at II-SDV
Search is the first Step : collecting information
Searching … and finding !
4II-SDV, Nice
Searching … and finding !
An exemple – my personal experience
5II-SDV, Nice
I tried to find a person during 23 years, roughly from 1993
to 2016
From 1993 to 1998 : no search engine available …
only private investigator ?
From 1999 to 2015 : regular Search – no results
I founded this person on facebook, not on google
From a browser : « f + tab » … « g + tab », « y + tab » …
Some years : no search, other years : multiples search
Searching … and finding !
6II-SDV, Nice
1) We all became private investigators one day or another
Searching … and finding !
7II-SDV, Nice
Searching … and finding !
8II-SDV, Nice
2) Different search engine lead to different results
Searching … and finding !
9II-SDV, Nice
2) Different search engine by country
Searching … and finding !
10II-SDV, Nice
Funny word : SEO … its more « how to be found on
Internet » … and you need to pay for it !
Searching … and finding !
11II-SDV, Nice
3) The person I was looking published on facebook using
his/her real name – its his/her decision to be visible or not
4) Where do we stand with the « Right to Forget »
Searching … and finding !
12II-SDV, Nice
Companies like Facebook have tons of data : they need to
provide search infrastructure (indexing + search interface)
I was lucky to make a try with facebook search interface
Searching … and finding !
13II-SDV, Nice
Discovery of Cholera – 1854 (John Snow)
http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
Searching … and finding !
14II-SDV, Nice
Bicycle Accident in Street : who is taking care of trafic management
Example in Boston :
http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html
Open Data
Searching … and finding !
15II-SDV, Nice
LION – 2016 (Garth Davis)
Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo
OpenSource LandScape
16II-SDV, Nice
Crawling
Indexing
Storing
WebSite
Reference
WebSite
Accessibility
Update Management
Search Interface
Result Visualization
Auto Completion
Natural Language
Voice Recognition
Maps
Ads
Unstructured data
Access Management
Search Platform Objectives
Constraints : being able to reach WebSite and content :
Internal WebSites (Intranet) & External WebSites
Internal Document Repositories
17II-SDV, Nice
Being able to index WebSite content (and page updates)
Beeing able to store unstructured data
Crawling
Storing
Indexing
Search Platform Objectives
18II-SDV, Nice
Provide usable Search results (auto classification,
visualization)
Don’t Forget why and what you search :
• You search in existing documents
• You need visualization tools
• Its not a crystal ball : search reflects the past
Provide usable Search interfaces (semantic search, multi
language search …)
Search Interface
Result Visualization
19II-SDV, Nice
Lucene is a java based indexing and search API
Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks
(Fusion) and ElasticSearch, provides packaging and extension of top of Lucene
and Solr.
-Nutch is the crawling component
-Tika is a document Metadata manager – content analysis toolkit
-Zookeeper is a multi thread process manager
OpenSource LandScape
20II-SDV, Nice
-Search Landscape
-Lucene : http://lucene.apache.org
-Solr/Lucene : http://lucene.apache.org/solr/
-Plateform OpenSearch : http://www.open-search-server.com
-Plateform Katta : http://katta.sourceforge.net
-Plateform LucidWorks : http://www.lucidworks.com
-Plateform ElasticSearch : http://www.elasticsearch.com
-Sphinx : http://sphinxsearch.com/
-Cloudera : https://www.cloudera.com/documentation/enterprise/5-5-
x/topics/search_architecture.html
-FranceLabs : http://www.francelabs.com/ (Datafari)
-AklaBox : www.aklabox.com (AklaSearch)
OpenSource LandScape
21II-SDV, Nice
Lucene : Retrieval Software library
Use existing Search Infrastructure like Solr/Lucene (Vanilla certified)
http://www.lucidworks.com/ or http://www.elasticsearch.org/
Search Engine Focus
22II-SDV, Nice
-Cloudera with Solr/Cloud (Solr/Lucene)
-Mapr with ElasticSearch (Lucene code)
-HortonWorks with LucidWorks (Solr/Lucene)
Hadoop Search Platform - Big Data
23II-SDV, Nice
Before indexing your document base, you need to access it !
Apache Nutch is a highly extensible and scalable open source web crawler
software project.
Reference : http://nutch.apache.org/
Nutch
24II-SDV, Nice
Solr
• What is Solr
– Indexation and Search Engine
• Promoted by the Apache Foundation
• Built on Top of Apache Lucene (Java Search library)
– Major engine characteristics
• Scalable, fault tolerance, distribution indexation process, dynamic
workload balancer, centraized configuration
– Technical environment
• Java
• Embeded Jetty server for platform administration
25II-SDV, Nice
Solr
Main characteristics
Admin Interface
Flexible and scalable Configuration
Modular
Multiple index management with a signle instance
26II-SDV, Nice
Solr
Main characteristics
Standard communication interfaces (html, xml, json)
Configuration can be done with or without schema
Real time Indexation
27II-SDV, Nice
Solr
Main characteristics
Customizable Full Text analysis
Rich documents indexation (using Tika)
28II-SDV, Nice
Solr
Main characteristics
Search by facet and filters
Term suggestion and orthograph correction
Geospatial Search
29II-SDV, Nice
Solr
Solr behavior
30II-SDV, Nice
-Synonyms
- It is possible to extend the search to synonyms if they are listed in a
glossary. For example, to find articles containing synonyms to “TV” when
you search with the word TV.
-Metadata
- Dictionary for list of searchable keywords
Search Engine Basic (1/2)
31II-SDV, Nice
-Reserved Words, Protected Words
- Indexing usually uses stemming, which is to reduce words to their root, for
example "Developp" to find items also contain the word when trying to
develop the word development. However, sometimes there are adverse
lemmatizations, indexing under one lemma two words that have no
relation. It is possible to prevent the stemming of words by listing them in
a file protwords.txt.
-StopWords
- The stopwords are meaningless words. A word considered insignificant
will be ignored. Note that some words are insignificant in some contexts,
others have homonyms signifiers. For example, can refer to a summer
season (rather mean) or past participle of the verb to be (relatively
insignificant). Stopwords.txt the file looks like this
Search Engine Basic (2/2)
32II-SDV, Nice
-Multi Language support (this is where commercial search engine have still more
to bring to customer), even there is now Asian type language support (Hindi,
Thai, Chineese, …)
-Elision :
- Elisions are a feature of the French, which consist of a contraction of the
words like or when they are followed by a vowel. Example: + aircraft gives
the aircraft. It is possible to remove these elisions using a lexicon.
-Limits solved other the past 3 years
• Full text search interface (language with search engine)
• SubQuery support : now its ok starting with Solr 4.7 (we are v6)
• Scalability (this is where Solr is taking technical advantage)
Search Engine Current Limits
33II-SDV, Nice
-Advance indexing and querying tools.
-Provides distributed searching capabilities to prevent bottleneck for a particular
server.
-Provides document excerpts (snippets) generation that provides summary of the
search
-Relevance ranking display extracts from the documents based on the query.
Search Interface expectation (1/3)
34II-SDV, Nice
-Duplicate document detection, including fuzzy near duplicates
-Rich Document Parsing and Indexing without using Database Indexing.
-Ranking control carry out a targeted ranking of individual documents.
-Search Grouping by Type / Tag / Categories (General page, documents, images)
Search Interface expectation (2/3)
35II-SDV, Nice
-Multi Criteria support
-Ranking
-Natural language support
-Apps Support (Android, Ipad)
Search Interface expectation (3/3)
Project at Ministry
Initial decision and guidelines from Ministry
36II-SDV, Nice
New WebSite will be done using Drupal CMS 8.2
WebSite should be powered by a « Google alike Search Toolbar »
WebSite – Infrastructure – should connect with multiples other
WebSite
All Infra (Software) must be Open Source components
Project at Ministry
37II-SDV, Nice
http://www.developpement-durable.gouv.fr/
Project at Ministry
38II-SDV, Nice
http://www.developpement-durable.gouv.fr/
Project at Ministry - Architecture
39II-SDV, Nice
Project at Ministry - Architecture
40II-SDV, Nice
Project at Ministry - Technical
41II-SDV, Nice
Projects Steps
Nutch crawler for various WebSite
• Facebook, LinkedIn, Twitter, Youtube …
• Internal WebSite, Previous WebSite
Drupal Forms for Metadata & indexation
• Specific Forms for different kind of documents
• Drupal CMS process to add new content
Drupal 8 Module for Solr : custom search, monitoring, reporting
• Existing drupal solr is limited to single instance of drupal
• Not possible to use Solr Admin interface
Project at Ministry - Technical
42II-SDV, Nice
Additional PHP libraries
Curl : Communication Drupal-Solr (http-get http-post & attached file)
Ssh2 : server administration command
Zookeeper : Communication Drupal-Zookeeper
MemCached : Communication Drupal-Memcached
Solarium : Communication Drupal-Solr (abstraction layer)
GoogleApi : youtube content indexation
Project at Ministry – Admin Interface
43II-SDV, Nice
Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
Project at Ministry – Admin Interface
44II-SDV, Nice
Drupal8 Addon to monitor the global infrastructure - Statistics
Project at Ministry - Validation
45II-SDV, Nice
Projects Validation & Deployment
No problems with Zookeeper, Solr, Nutch
Stress tests for the global platform : initial slow down with 10 000
simultaneous connection
Sub-Project : Adressing the Single Point of Failure
Solution : Problems with Drupal & MySql -> MemCached
Project at Ministry - Next
46II-SDV, Nice
Next Steps
Review of WebSite content … new Ministry
New Content to be indexed :
• Other WebSite and Social Content
• New set of document to be added in the repository
47II-SDV, Nice

More Related Content

What's hot

II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
Dr. Haxel Consult
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
Dr. Haxel Consult
 

What's hot (20)

II-SDV 2017: Spotting the Stars in your Galaxy of Patent Data
II-SDV 2017: Spotting the Stars in your Galaxy of Patent DataII-SDV 2017: Spotting the Stars in your Galaxy of Patent Data
II-SDV 2017: Spotting the Stars in your Galaxy of Patent Data
 
ICIC 2017: New product presentation minesoft
ICIC 2017: New product presentation minesoftICIC 2017: New product presentation minesoft
ICIC 2017: New product presentation minesoft
 
II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...II-PIC 2017: Gain insight into technical, legal and business information thro...
II-PIC 2017: Gain insight into technical, legal and business information thro...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
ICIC 2017: How to effectively monitor Technological Developments in IP
ICIC 2017: How to effectively monitor Technological Developments in IPICIC 2017: How to effectively monitor Technological Developments in IP
ICIC 2017: How to effectively monitor Technological Developments in IP
 
ICIC 2017: New product presentationsLighthouse IP
ICIC 2017: New product presentationsLighthouse IPICIC 2017: New product presentationsLighthouse IP
ICIC 2017: New product presentationsLighthouse IP
 
ICIC 2017: Product presentations FIZ Karlsruhe
ICIC 2017: Product presentations FIZ KarlsruheICIC 2017: Product presentations FIZ Karlsruhe
ICIC 2017: Product presentations FIZ Karlsruhe
 
II-SDV 2016 GRIDLOGICS
II-SDV 2016 GRIDLOGICSII-SDV 2016 GRIDLOGICS
II-SDV 2016 GRIDLOGICS
 
AI-SDV 2020: Using Transformer technology to build an AI based personal News ...
AI-SDV 2020: Using Transformer technology to build an AI based personal News ...AI-SDV 2020: Using Transformer technology to build an AI based personal News ...
AI-SDV 2020: Using Transformer technology to build an AI based personal News ...
 
II-PIC 2017: Product Presentation LexisNexis
II-PIC 2017: Product Presentation LexisNexisII-PIC 2017: Product Presentation LexisNexis
II-PIC 2017: Product Presentation LexisNexis
 
Big Data: Big Issues for IP
Big Data: Big Issues for IPBig Data: Big Issues for IP
Big Data: Big Issues for IP
 
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
 
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
II-SDV 2016 Aalt van de Kuilen - The Art of Patent LandscapingII-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
II-SDV 2016 Aalt van de Kuilen - The Art of Patent Landscaping
 
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
II-PIC 2017: Artificial Intelligence, Machine Learning, And Deep Neural Netwo...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
II-SDV 2016 VantagePoint
II-SDV 2016 VantagePointII-SDV 2016 VantagePoint
II-SDV 2016 VantagePoint
 
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
II-SDV 2016 Irene Kitsara - Patent Landscape Reports and Other WIPO Activitie...
 
II-SDV 2016 Questel Intellixir
II-SDV 2016 Questel IntellixirII-SDV 2016 Questel Intellixir
II-SDV 2016 Questel Intellixir
 
AI-SDV 2021 - Deep SEARCH 9
AI-SDV 2021 - Deep SEARCH 9AI-SDV 2021 - Deep SEARCH 9
AI-SDV 2021 - Deep SEARCH 9
 
Data Science Application in Business Portfolio & Risk Management
Data Science Application in Business Portfolio & Risk ManagementData Science Application in Business Portfolio & Risk Management
Data Science Application in Business Portfolio & Risk Management
 

Similar to II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
Dr. Haxel Consult
 
Introducing the Linked Data Research Centre
Introducing the Linked Data Research CentreIntroducing the Linked Data Research Centre
Introducing the Linked Data Research Centre
Michael Hausenblas
 
The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011
David F. Flanders
 
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Amit Sheth
 

Similar to II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment (20)

II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - ...
 
II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
 
Searching tech2
Searching tech2Searching tech2
Searching tech2
 
Introducing the Linked Data Research Centre
Introducing the Linked Data Research CentreIntroducing the Linked Data Research Centre
Introducing the Linked Data Research Centre
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Web3.0 or The semantic web
Web3.0 or The semantic webWeb3.0 or The semantic web
Web3.0 or The semantic web
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011The Archives Forum - The National Archives - 02 March 2011
The Archives Forum - The National Archives - 02 March 2011
 
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic WebDataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
DataPortability and Me: Introducing SIOC, FOAF and the Semantic Web
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Go open2010 sde_20100417
Go open2010 sde_20100417Go open2010 sde_20100417
Go open2010 sde_20100417
 
2.0 Watch
2.0 Watch2.0 Watch
2.0 Watch
 
Mythology of search engine
Mythology of search engineMythology of search engine
Mythology of search engine
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
lodlam summit session browsable linked data
lodlam summit session browsable linked datalodlam summit session browsable linked data
lodlam summit session browsable linked data
 

More from Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 

More from Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Recently uploaded

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 

Recently uploaded (20)

哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

  • 1. Patrick Beaucamp Founder of the Vanilla Project Mail : Patrick.beaucamp@bpm-conseil.com Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment II-SDV, Nice 24th April 2017 1II-SDV, Nice
  • 2. Presentation Agenda Open Source Search Engine & Search Platform Some interesting Platforms Features expected for Search Platforms (Interface) 2II-SDV, Nice Open Source Platform at French Ministry Project Context Platform Architecture WebSite Powered by a Search engine Echo : Tuesday am, presentation from Deep Search 9 and Tuesday pm prssentation from FranceLabs Personal Experience of Search
  • 3. Searching … and finding ! II-SDV : SEARCH, DATA MINING and VISUALISATION 3II-SDV, Nice How many times per day do you Google ? (search, maps, translate …) Tribute to Open Source at II-SDV Search is the first Step : collecting information
  • 4. Searching … and finding ! 4II-SDV, Nice
  • 5. Searching … and finding ! An exemple – my personal experience 5II-SDV, Nice I tried to find a person during 23 years, roughly from 1993 to 2016 From 1993 to 1998 : no search engine available … only private investigator ? From 1999 to 2015 : regular Search – no results I founded this person on facebook, not on google From a browser : « f + tab » … « g + tab », « y + tab » … Some years : no search, other years : multiples search
  • 6. Searching … and finding ! 6II-SDV, Nice 1) We all became private investigators one day or another
  • 7. Searching … and finding ! 7II-SDV, Nice
  • 8. Searching … and finding ! 8II-SDV, Nice 2) Different search engine lead to different results
  • 9. Searching … and finding ! 9II-SDV, Nice 2) Different search engine by country
  • 10. Searching … and finding ! 10II-SDV, Nice Funny word : SEO … its more « how to be found on Internet » … and you need to pay for it !
  • 11. Searching … and finding ! 11II-SDV, Nice 3) The person I was looking published on facebook using his/her real name – its his/her decision to be visible or not 4) Where do we stand with the « Right to Forget »
  • 12. Searching … and finding ! 12II-SDV, Nice Companies like Facebook have tons of data : they need to provide search infrastructure (indexing + search interface) I was lucky to make a try with facebook search interface
  • 13. Searching … and finding ! 13II-SDV, Nice Discovery of Cholera – 1854 (John Snow) http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
  • 14. Searching … and finding ! 14II-SDV, Nice Bicycle Accident in Street : who is taking care of trafic management Example in Boston : http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html Open Data
  • 15. Searching … and finding ! 15II-SDV, Nice LION – 2016 (Garth Davis) Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo
  • 16. OpenSource LandScape 16II-SDV, Nice Crawling Indexing Storing WebSite Reference WebSite Accessibility Update Management Search Interface Result Visualization Auto Completion Natural Language Voice Recognition Maps Ads Unstructured data Access Management
  • 17. Search Platform Objectives Constraints : being able to reach WebSite and content : Internal WebSites (Intranet) & External WebSites Internal Document Repositories 17II-SDV, Nice Being able to index WebSite content (and page updates) Beeing able to store unstructured data Crawling Storing Indexing
  • 18. Search Platform Objectives 18II-SDV, Nice Provide usable Search results (auto classification, visualization) Don’t Forget why and what you search : • You search in existing documents • You need visualization tools • Its not a crystal ball : search reflects the past Provide usable Search interfaces (semantic search, multi language search …) Search Interface Result Visualization
  • 19. 19II-SDV, Nice Lucene is a java based indexing and search API Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks (Fusion) and ElasticSearch, provides packaging and extension of top of Lucene and Solr. -Nutch is the crawling component -Tika is a document Metadata manager – content analysis toolkit -Zookeeper is a multi thread process manager OpenSource LandScape
  • 20. 20II-SDV, Nice -Search Landscape -Lucene : http://lucene.apache.org -Solr/Lucene : http://lucene.apache.org/solr/ -Plateform OpenSearch : http://www.open-search-server.com -Plateform Katta : http://katta.sourceforge.net -Plateform LucidWorks : http://www.lucidworks.com -Plateform ElasticSearch : http://www.elasticsearch.com -Sphinx : http://sphinxsearch.com/ -Cloudera : https://www.cloudera.com/documentation/enterprise/5-5- x/topics/search_architecture.html -FranceLabs : http://www.francelabs.com/ (Datafari) -AklaBox : www.aklabox.com (AklaSearch) OpenSource LandScape
  • 21. 21II-SDV, Nice Lucene : Retrieval Software library Use existing Search Infrastructure like Solr/Lucene (Vanilla certified) http://www.lucidworks.com/ or http://www.elasticsearch.org/ Search Engine Focus
  • 22. 22II-SDV, Nice -Cloudera with Solr/Cloud (Solr/Lucene) -Mapr with ElasticSearch (Lucene code) -HortonWorks with LucidWorks (Solr/Lucene) Hadoop Search Platform - Big Data
  • 23. 23II-SDV, Nice Before indexing your document base, you need to access it ! Apache Nutch is a highly extensible and scalable open source web crawler software project. Reference : http://nutch.apache.org/ Nutch
  • 24. 24II-SDV, Nice Solr • What is Solr – Indexation and Search Engine • Promoted by the Apache Foundation • Built on Top of Apache Lucene (Java Search library) – Major engine characteristics • Scalable, fault tolerance, distribution indexation process, dynamic workload balancer, centraized configuration – Technical environment • Java • Embeded Jetty server for platform administration
  • 25. 25II-SDV, Nice Solr Main characteristics Admin Interface Flexible and scalable Configuration Modular Multiple index management with a signle instance
  • 26. 26II-SDV, Nice Solr Main characteristics Standard communication interfaces (html, xml, json) Configuration can be done with or without schema Real time Indexation
  • 27. 27II-SDV, Nice Solr Main characteristics Customizable Full Text analysis Rich documents indexation (using Tika)
  • 28. 28II-SDV, Nice Solr Main characteristics Search by facet and filters Term suggestion and orthograph correction Geospatial Search
  • 30. 30II-SDV, Nice -Synonyms - It is possible to extend the search to synonyms if they are listed in a glossary. For example, to find articles containing synonyms to “TV” when you search with the word TV. -Metadata - Dictionary for list of searchable keywords Search Engine Basic (1/2)
  • 31. 31II-SDV, Nice -Reserved Words, Protected Words - Indexing usually uses stemming, which is to reduce words to their root, for example "Developp" to find items also contain the word when trying to develop the word development. However, sometimes there are adverse lemmatizations, indexing under one lemma two words that have no relation. It is possible to prevent the stemming of words by listing them in a file protwords.txt. -StopWords - The stopwords are meaningless words. A word considered insignificant will be ignored. Note that some words are insignificant in some contexts, others have homonyms signifiers. For example, can refer to a summer season (rather mean) or past participle of the verb to be (relatively insignificant). Stopwords.txt the file looks like this Search Engine Basic (2/2)
  • 32. 32II-SDV, Nice -Multi Language support (this is where commercial search engine have still more to bring to customer), even there is now Asian type language support (Hindi, Thai, Chineese, …) -Elision : - Elisions are a feature of the French, which consist of a contraction of the words like or when they are followed by a vowel. Example: + aircraft gives the aircraft. It is possible to remove these elisions using a lexicon. -Limits solved other the past 3 years • Full text search interface (language with search engine) • SubQuery support : now its ok starting with Solr 4.7 (we are v6) • Scalability (this is where Solr is taking technical advantage) Search Engine Current Limits
  • 33. 33II-SDV, Nice -Advance indexing and querying tools. -Provides distributed searching capabilities to prevent bottleneck for a particular server. -Provides document excerpts (snippets) generation that provides summary of the search -Relevance ranking display extracts from the documents based on the query. Search Interface expectation (1/3)
  • 34. 34II-SDV, Nice -Duplicate document detection, including fuzzy near duplicates -Rich Document Parsing and Indexing without using Database Indexing. -Ranking control carry out a targeted ranking of individual documents. -Search Grouping by Type / Tag / Categories (General page, documents, images) Search Interface expectation (2/3)
  • 35. 35II-SDV, Nice -Multi Criteria support -Ranking -Natural language support -Apps Support (Android, Ipad) Search Interface expectation (3/3)
  • 36. Project at Ministry Initial decision and guidelines from Ministry 36II-SDV, Nice New WebSite will be done using Drupal CMS 8.2 WebSite should be powered by a « Google alike Search Toolbar » WebSite – Infrastructure – should connect with multiples other WebSite All Infra (Software) must be Open Source components
  • 37. Project at Ministry 37II-SDV, Nice http://www.developpement-durable.gouv.fr/
  • 38. Project at Ministry 38II-SDV, Nice http://www.developpement-durable.gouv.fr/
  • 39. Project at Ministry - Architecture 39II-SDV, Nice
  • 40. Project at Ministry - Architecture 40II-SDV, Nice
  • 41. Project at Ministry - Technical 41II-SDV, Nice Projects Steps Nutch crawler for various WebSite • Facebook, LinkedIn, Twitter, Youtube … • Internal WebSite, Previous WebSite Drupal Forms for Metadata & indexation • Specific Forms for different kind of documents • Drupal CMS process to add new content Drupal 8 Module for Solr : custom search, monitoring, reporting • Existing drupal solr is limited to single instance of drupal • Not possible to use Solr Admin interface
  • 42. Project at Ministry - Technical 42II-SDV, Nice Additional PHP libraries Curl : Communication Drupal-Solr (http-get http-post & attached file) Ssh2 : server administration command Zookeeper : Communication Drupal-Zookeeper MemCached : Communication Drupal-Memcached Solarium : Communication Drupal-Solr (abstraction layer) GoogleApi : youtube content indexation
  • 43. Project at Ministry – Admin Interface 43II-SDV, Nice Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)
  • 44. Project at Ministry – Admin Interface 44II-SDV, Nice Drupal8 Addon to monitor the global infrastructure - Statistics
  • 45. Project at Ministry - Validation 45II-SDV, Nice Projects Validation & Deployment No problems with Zookeeper, Solr, Nutch Stress tests for the global platform : initial slow down with 10 000 simultaneous connection Sub-Project : Adressing the Single Point of Failure Solution : Problems with Drupal & MySql -> MemCached
  • 46. Project at Ministry - Next 46II-SDV, Nice Next Steps Review of WebSite content … new Ministry New Content to be indexed : • Other WebSite and Social Content • New set of document to be added in the repository