SlideShare uma empresa Scribd logo
1 de 40
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Properly integrate ManifoldCF with Solr
Aurélien MAZOYER
Search Expert, Co-founder, France Labs
3
01
Apache Manifold CF
o Agenda
• Overview of ManifoldCF
• Our scenario : find files on a file share
• In real life
4
01
Apache Manifold CF
o Overview
• Connector Framework
• Incremental crawling
• Handle authorization
• Configuration via REST API and UI
5
01
Apache Manifold CF
o History
• Based on « Connector Framework » developed by Karl Wright
for the MetaCarta Appliance
• Donated to the Apache Software Foundation in 2009
• May 2012 : out of incubation
• Current version : 2.2 (August 2015)
6
01
Connectors gone wild
o Different connectors for :
• Content repositories
• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…
• But also Windows Share, Sharepoint, Dropbox…
• Authorities
• LDAP, AD, CMIS…
• Output
• Solr, Elasticsearch, OSS…
7
03
Big picture
Manifold CF
Solr Elasticsearch Repository N
OpenLDAP
Authority N
…
Daemon Agent
Conn. 1
Manifold CF
authority
service
Ouputs
Authorities
Conn. 2
Conn. N
ManifoldCF
UI
ManifoldCF
API
Conn. 1 Conn. 2 Conn. N
Wiki
DB
Repository N
…
…
Repositories
Conn. 1
Conn. N
8
01
Roles of components
o Daemon agent
• Java process
• Run repository and ouput connectors
• Run data crawling jobs
9
01
Roles of components
o Authority service
• Web application
• Run authority connectors
• Get security tokens for a specific user
10
01
Component
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
o ManifoldCF UI
That’s it.
11
01
API Configuration
o API
12
01
Test it!
o For testing purpose:
• java –jar post.jar
• All-in-one process
• Embedded database (HSQL)
13
01
Taking MCF to production
Multi-process deployment
o 3 web application in a servlet container
• mcf-crawler-ui
• mcf-authorization-service
• mcf-api-service
o Daemon agent
o Database
• PostgresSQL
o Synchronize on filesystem ( local or distributed (zK) )
14
01
Search files with Security : Solr + MCF
o Our scenario
• File share using Active Directory
• Search with Solr
• With security constraints
15
01
Security model : Solr + MCF
o Authorization
• Early Binding
• Index documents with ACLs
• Compute authorization at runtime
o Authentication
• Not handled by Solr/ManifoldCF
• Front-end application should authenticate user
16
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Phase 1 : Indexing
Repositories Authorities
Output Connector
Solr
Extracting
Handler
Manifold CF
authority
service
AD
ConnectorWindows
Share
MCF Plugin
Send docs and
ACLs
Crawl
documents
with ACLs
Get User
access token
Solr
MCF Plugin
17
01
Search files with security : Solr + MCF
Manifold CF
AD
Daemon
Agent
JCIFS
Connector
Solr
connector
Repositories Authorities
Extracting
Handler
Manifold CF
authority
service
AD
Connector
Front End Authenticated Search Filter docs based on
ACLs and users info
Authorized results
Phase 2 : Searching
Output Connector
Windows
Share
18
01
Configure Solr + MCF
o side
o 4 connections and 1 job
• Create Windows Share connection
• Create Solr connection
• Create Active Directory connection
• Create Authority Group connection
• Create a crawling Job
19
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
20
01
Component
AD Group
Crawl Job Solr Connection
AD Connection
Windows Share
Connection
21
01
Configure Solr + MCF
o Frond end side
o Authentication
• For Tomcat
• JDNI Tomcat Realm
• TomcatSPNEGO
22
01
Configure Solr + MCF
o side
o Modify schema.xml
• Add fields for security tokens
o Modify solrconfig.xml
• Add MCF Solr Plugin (query parser)
o And don’t forget to protect the Solr instance :-P
23
01
Configure Solr + MCF
o Leverage Solr Extracting handler
• Based on ApacheTika
• Mime type detection
• Embed parsing library
• Supported extension:
• MS Office (OLE2 and OOXML)
• OpenDocument
• Pdf
• Audio/video/image files
• Now OCRs thanks to Tika 1.7 (and Tesseract)
o Now, can be done directly in MCF!
24
01
Component
0…1
1…*
Authority Group
Authority Connection
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job
1…1 1…* 1…* 1…*
Transformation
Connection
0…*
1…*
25
01
Crawling principle
o Crawling model
• Incremental model
• Continuous model
ManifoldCF In Action – Chapter 1 (Karl Wright)
Phase 1 Phase 2
26
01
Incremental crawling of file share
o Incremental crawling not so easy with some
repositories:
Windows Shar
e Connector
JCIFS
Windows Share
Uhuuu, file share, what's new
since last time we met?
Errkkk…
27
01
Incremental crawling of file share : Solr + MCF
o Phase 1 : Discovery/Indexing Depth first
Fetch SMB file attributes
If file is a directory and if matches inclusion regex
For each file
If file is a regular file and if matches inclusion regex
List files in SMB directory
Check ingeststatus entry in crawler DB
If no entry or the version attribute is different
Fetch file content
Update ingeststatus entry in DB
Push file to Solr
For each start path
entry
Windows Share
28
01
o What is ingeststatus database entry?
o Simplified version :
o LastVersion?
• Here, computed from lastModified and ACLs on the file
DOCURI LAST_INGEST LAST_VERSION
protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1
protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1
+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-
2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1
84/ShareFolder/TestFile.txt+1444462827664:16Y
Incremental crawling of file share
29
01
Incremental crawling of file share : Solr + MCF
o Phase 2 : Deleting unreachable documents
Update Crawler database
Send delete command to Solr
For each crawler DB entry
30
01
How to see what happened
o Search History
o Monitoring
• Job Status
• Notification Connections
31
01
How to see what happened
o Search History
o History
• Simple History
• Maximum Activity
• Maximum Bandwidth
• Result Histogram
o Status
• Document Status
• Queue Status
32
01
Performance issue
o Find bottleneck
• Crawled repository
• Network
• Solr
• MCF database
• MCF configuration
33
01
Handle performance issue
o Specific connector’s configuration
• Throttling
• Max JVM connections
o Can improve speed / limit impact on crawled repository
o Very specific to the repository
34
01
Handle performance issue
o Job settings
o Size limit of ingested documents
o Use regex to remove some extensions from crawl
35
01
Investigate errors
• Increase connector’s log level
• Read MCF simple history
• Thread Dump
36
01
Common errors in file crawling
o Crawler account rights
o Exotic files
o Very biiiiiiig files
o JCIFS errors
o Solr connector timeout
37
01
When use ManifoldCF?
q = crawled_environment:heterogeneous
OR scenario:intranet
OR security:mandatory
38
01
References
o ManifoldCF documentation
https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html
o ManifoldCF in Action (K. Wright)
https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
o Securing Solr document with MCF (K. Wright)
http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
o France Labs blog posts :
http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/
http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
39
01
Datafari
Search
Admin
o Intranet “ready to play” search solution
• Apache License
o Embed:
o Solr
o ManifoldCF
o And other cool stuff:
• Admin and responsive search UI
• User Management
• Banana for user behavior analysis
• Tesseract OCR
• A funny zebra
• Etc…
www.datafari.com
40
aurelien.mazoyer@francelabs.com
@francelabs
www.francelabs.com

Mais conteúdo relacionado

Mais procurados

Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 

Mais procurados (20)

MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Pub/Sub Messaging
Pub/Sub MessagingPub/Sub Messaging
Pub/Sub Messaging
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
 
How to Send IDOC to SAP using MuleSoft
How to Send IDOC to SAP using MuleSoftHow to Send IDOC to SAP using MuleSoft
How to Send IDOC to SAP using MuleSoft
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Spark access control on Amazon EMR with AWS Lake Formation
Spark access control on Amazon EMR with AWS Lake FormationSpark access control on Amazon EMR with AWS Lake Formation
Spark access control on Amazon EMR with AWS Lake Formation
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
 Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S... Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
Walking through the Spring Stack for Apache Kafka with Soby Chacko | Kafka S...
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
 
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Event driven architecture with Kafka
Event driven architecture with KafkaEvent driven architecture with Kafka
Event driven architecture with Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Data-Streaming at DKV
Data-Streaming at DKVData-Streaming at DKV
Data-Streaming at DKV
 

Destaque

Coveo_Intelligent_Workplace_eBook
Coveo_Intelligent_Workplace_eBookCoveo_Intelligent_Workplace_eBook
Coveo_Intelligent_Workplace_eBook
Stephen Alfano
 

Destaque (20)

A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
Presentation Lucene / Solr / Datafari - Nantes JUG
Presentation Lucene / Solr / Datafari - Nantes JUGPresentation Lucene / Solr / Datafari - Nantes JUG
Presentation Lucene / Solr / Datafari - Nantes JUG
 
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearchBesoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch
Besoin de rien Envie de Search - Presentation Lucene Solr ElasticSearch
 
Apprendre Solr en deux heures
Apprendre Solr en deux heuresApprendre Solr en deux heures
Apprendre Solr en deux heures
 
Using Enterprise Search at the city of Antibes
Using Enterprise Search at the city of AntibesUsing Enterprise Search at the city of Antibes
Using Enterprise Search at the city of Antibes
 
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick BauerSitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
 
Plannning for the GSA Sunsetting feat. Coveo
Plannning for the GSA Sunsetting feat. CoveoPlannning for the GSA Sunsetting feat. Coveo
Plannning for the GSA Sunsetting feat. Coveo
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
 
Concepts de Recherche dans un environnement WSS et MOSS
Concepts de Recherche dans un environnement WSS et MOSSConcepts de Recherche dans un environnement WSS et MOSS
Concepts de Recherche dans un environnement WSS et MOSS
 
SharePoint Search for Dummies
SharePoint Search for DummiesSharePoint Search for Dummies
SharePoint Search for Dummies
 
Coveo Search - Product Overview
Coveo Search - Product OverviewCoveo Search - Product Overview
Coveo Search - Product Overview
 
Coveo_Intelligent_Workplace_eBook
Coveo_Intelligent_Workplace_eBookCoveo_Intelligent_Workplace_eBook
Coveo_Intelligent_Workplace_eBook
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
 
Apache ManifoldCF
Apache ManifoldCFApache ManifoldCF
Apache ManifoldCF
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Netflix Global Search - Lucene Revolution
Netflix Global Search - Lucene RevolutionNetflix Global Search - Lucene Revolution
Netflix Global Search - Lucene Revolution
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 

Semelhante a Integrate ManifoldCF with Solr

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Lucidworks
 
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBMUnderstanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Lucidworks
 

Semelhante a Integrate ManifoldCF with Solr (20)

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
II-SDV 2017: Datafari - Building an Open Source Enterprise Search Solution fr...
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBMUnderstanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
Understanding the Solr Security Framekwork: Presented by Anshum Gupta, IBM
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache Solr for TYPO3 Components & Review 2016
Apache Solr for TYPO3 Components & Review 2016Apache Solr for TYPO3 Components & Review 2016
Apache Solr for TYPO3 Components & Review 2016
 
Kubernetes2
Kubernetes2Kubernetes2
Kubernetes2
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with Solr
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
ThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptx
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 

Mais de francelabs

Mais de francelabs (6)

Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
Migration d'Exalead vers Solr - IFCE et France Labs - Search Day 2014
 
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
Apache Solr pour le eCommerce chez Allopneus avec France Labs - Lib'day2014
 
Geneva jug Lucene Solr
Geneva jug Lucene Solr Geneva jug Lucene Solr
Geneva jug Lucene Solr
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big DataSolr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big Data
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
 
Marseille JUG Novembre 2013 Lucene Solr France Labs
Marseille JUG Novembre 2013 Lucene Solr France LabsMarseille JUG Novembre 2013 Lucene Solr France Labs
Marseille JUG Novembre 2013 Lucene Solr France Labs
 

Último

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Último (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 

Integrate ManifoldCF with Solr

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Properly integrate ManifoldCF with Solr Aurélien MAZOYER Search Expert, Co-founder, France Labs
  • 3. 3 01 Apache Manifold CF o Agenda • Overview of ManifoldCF • Our scenario : find files on a file share • In real life
  • 4. 4 01 Apache Manifold CF o Overview • Connector Framework • Incremental crawling • Handle authorization • Configuration via REST API and UI
  • 5. 5 01 Apache Manifold CF o History • Based on « Connector Framework » developed by Karl Wright for the MetaCarta Appliance • Donated to the Apache Software Foundation in 2009 • May 2012 : out of incubation • Current version : 2.2 (August 2015)
  • 6. 6 01 Connectors gone wild o Different connectors for : • Content repositories • Web, Wiki, DB, Email, RSS, CMIS, Alfresco… • But also Windows Share, Sharepoint, Dropbox… • Authorities • LDAP, AD, CMIS… • Output • Solr, Elasticsearch, OSS…
  • 7. 7 03 Big picture Manifold CF Solr Elasticsearch Repository N OpenLDAP Authority N … Daemon Agent Conn. 1 Manifold CF authority service Ouputs Authorities Conn. 2 Conn. N ManifoldCF UI ManifoldCF API Conn. 1 Conn. 2 Conn. N Wiki DB Repository N … … Repositories Conn. 1 Conn. N
  • 8. 8 01 Roles of components o Daemon agent • Java process • Run repository and ouput connectors • Run data crawling jobs
  • 9. 9 01 Roles of components o Authority service • Web application • Run authority connectors • Get security tokens for a specific user
  • 10. 10 01 Component Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* o ManifoldCF UI That’s it.
  • 12. 12 01 Test it! o For testing purpose: • java –jar post.jar • All-in-one process • Embedded database (HSQL)
  • 13. 13 01 Taking MCF to production Multi-process deployment o 3 web application in a servlet container • mcf-crawler-ui • mcf-authorization-service • mcf-api-service o Daemon agent o Database • PostgresSQL o Synchronize on filesystem ( local or distributed (zK) )
  • 14. 14 01 Search files with Security : Solr + MCF o Our scenario • File share using Active Directory • Search with Solr • With security constraints
  • 15. 15 01 Security model : Solr + MCF o Authorization • Early Binding • Index documents with ACLs • Compute authorization at runtime o Authentication • Not handled by Solr/ManifoldCF • Front-end application should authenticate user
  • 16. 16 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Phase 1 : Indexing Repositories Authorities Output Connector Solr Extracting Handler Manifold CF authority service AD ConnectorWindows Share MCF Plugin Send docs and ACLs Crawl documents with ACLs
  • 17. Get User access token Solr MCF Plugin 17 01 Search files with security : Solr + MCF Manifold CF AD Daemon Agent JCIFS Connector Solr connector Repositories Authorities Extracting Handler Manifold CF authority service AD Connector Front End Authenticated Search Filter docs based on ACLs and users info Authorized results Phase 2 : Searching Output Connector Windows Share
  • 18. 18 01 Configure Solr + MCF o side o 4 connections and 1 job • Create Windows Share connection • Create Solr connection • Create Active Directory connection • Create Authority Group connection • Create a crawling Job
  • 19. 19 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…*
  • 20. 20 01 Component AD Group Crawl Job Solr Connection AD Connection Windows Share Connection
  • 21. 21 01 Configure Solr + MCF o Frond end side o Authentication • For Tomcat • JDNI Tomcat Realm • TomcatSPNEGO
  • 22. 22 01 Configure Solr + MCF o side o Modify schema.xml • Add fields for security tokens o Modify solrconfig.xml • Add MCF Solr Plugin (query parser) o And don’t forget to protect the Solr instance :-P
  • 23. 23 01 Configure Solr + MCF o Leverage Solr Extracting handler • Based on ApacheTika • Mime type detection • Embed parsing library • Supported extension: • MS Office (OLE2 and OOXML) • OpenDocument • Pdf • Audio/video/image files • Now OCRs thanks to Tika 1.7 (and Tesseract) o Now, can be done directly in MCF!
  • 24. 24 01 Component 0…1 1…* Authority Group Authority Connection 1…1 1…* Ouput ConnectionRepo Connection Crawl Job 1…1 1…* 1…* 1…* Transformation Connection 0…* 1…*
  • 25. 25 01 Crawling principle o Crawling model • Incremental model • Continuous model ManifoldCF In Action – Chapter 1 (Karl Wright) Phase 1 Phase 2
  • 26. 26 01 Incremental crawling of file share o Incremental crawling not so easy with some repositories: Windows Shar e Connector JCIFS Windows Share Uhuuu, file share, what's new since last time we met? Errkkk…
  • 27. 27 01 Incremental crawling of file share : Solr + MCF o Phase 1 : Discovery/Indexing Depth first Fetch SMB file attributes If file is a directory and if matches inclusion regex For each file If file is a regular file and if matches inclusion regex List files in SMB directory Check ingeststatus entry in crawler DB If no entry or the version attribute is different Fetch file content Update ingeststatus entry in DB Push file to Solr For each start path entry Windows Share
  • 28. 28 01 o What is ingeststatus database entry? o Simplified version : o LastVersion? • Here, computed from lastModified and ACLs on the file DOCURI LAST_INGEST LAST_VERSION protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1 protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1 +S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023- 2036360560-1108467148-500+S-1-5-32-544+1+DEAD_AUTHORITY+-file://///52.30.17.1 84/ShareFolder/TestFile.txt+1444462827664:16Y Incremental crawling of file share
  • 29. 29 01 Incremental crawling of file share : Solr + MCF o Phase 2 : Deleting unreachable documents Update Crawler database Send delete command to Solr For each crawler DB entry
  • 30. 30 01 How to see what happened o Search History o Monitoring • Job Status • Notification Connections
  • 31. 31 01 How to see what happened o Search History o History • Simple History • Maximum Activity • Maximum Bandwidth • Result Histogram o Status • Document Status • Queue Status
  • 32. 32 01 Performance issue o Find bottleneck • Crawled repository • Network • Solr • MCF database • MCF configuration
  • 33. 33 01 Handle performance issue o Specific connector’s configuration • Throttling • Max JVM connections o Can improve speed / limit impact on crawled repository o Very specific to the repository
  • 34. 34 01 Handle performance issue o Job settings o Size limit of ingested documents o Use regex to remove some extensions from crawl
  • 35. 35 01 Investigate errors • Increase connector’s log level • Read MCF simple history • Thread Dump
  • 36. 36 01 Common errors in file crawling o Crawler account rights o Exotic files o Very biiiiiiig files o JCIFS errors o Solr connector timeout
  • 37. 37 01 When use ManifoldCF? q = crawled_environment:heterogeneous OR scenario:intranet OR security:mandatory
  • 38. 38 01 References o ManifoldCF documentation https://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html o ManifoldCF in Action (K. Wright) https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs o Securing Solr document with MCF (K. Wright) http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011 o France Labs blog posts : http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/ http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
  • 39. 39 01 Datafari Search Admin o Intranet “ready to play” search solution • Apache License o Embed: o Solr o ManifoldCF o And other cool stuff: • Admin and responsive search UI • User Management • Banana for user behavior analysis • Tesseract OCR • A funny zebra • Etc… www.datafari.com

Notas do Editor

  1. Start : 0:00. End : 1:05 Hi, Thank you, Moi : Aurélien MAZOYER Co founder of France Labs, open source company based in France. We offer consulting on search technologies, icosystem Datafari, intranet search solution. Say a few word about datafari at the end of talk. Topic, Survey : how many of you have ever use manifoldcf ?
  2. Start : 1:05. End : 1:40 3 parts Overview of MCF. Explain a Case study on the integration of MCF with Solr in order to search file What happens with mcf
  3. Start 1:40 to 2:45 CF stands for Connector framework. That means that it is a tool that help you to connect heterodjinious Push the data to your favorite search engine Keep it syncronize Take access right into account to perform authenticated Provides a Complete UI and REST API
  4. Start 2:45 to 3:15 Karl wright when he worked for Apache Top Level Project since 2012. Active project : last release this summer.
  5. Start 3:15 to 4:25 Plenty of connectors included in ManifoldCF What is called You can write your own (ManifoldCF In action give you the best practice to write your own connector) Domain controller, such as an active directory Search engine
  6. Start 4:25 to 5:03 Contains Different components You can see the different connectors for the interaction with the external world Administration interface Talk about it in a few slide. Cannot see here a underlying database, backbone of the solution
  7. Start 5:03 to 5:30 Actually do the crawling job
  8. Start 5:30 to 6:26 You add the username in parameter provide the security tokens for a specific user. It gives For example the sid of the user in Active Directory, and the sid of all groups that he belongs to
  9. Start 6:26 to 7:14 Also web application. Administrate MCF. To begin, you will have to create Crawl Job. Start the job. Once you are done, you are now able to start your crawl
  10. Start 7:14 to 8:00 What can be done in the admin Put new config Send command Respect REST standards
  11. Start 8:00 to 8:49 Very simple to test it. Extract the binary distribution, open example directory Not unfamiliar TO solr users Not recommanded way to run it in production (mainly because of the HSQL database)
  12. Start 8:49 to 10:00 Component we described in different processes. The database is very important. One of the recommanded database Synchronize via local folder on the machine or with zookeeper.
  13. Start 10:00 to 11:07 Here is our scenario Let’s imagine An intranet Users who authenticate against AD They put their files on a shared folders. You have access rights on folders based on the user. But specific permission for some users. Of course it is a mess so they need a good search engine to find theirs documents Quite simple, not very unusuable, but it can be a nightmare if you don’t have to right tool We are here in a full proprietary environment. But we will see that MCF and solr can deal with it.
  14. Start 11:07 to 12:00 A few words Autorisation, when user runs a solr query Nither solr nor mcf will do this job Up to the front end application
  15. Start 12:00 to 12:34 Go back to the big picture Step 1 JCIFS connector fetches documents with theirs access control and push to Solr Extracting handler
  16. Start 12:34 to 13:00 Step 2 : Frontend sends an authenticated query Retrieves the security tokens linked to the current user Then, runs a normal search and filter the result set with the help the document acces control list and user security tokens
  17. Start 13:00 to 13:44 How can we actually implement that. ON the mcfside Windows share connection. Some few step to do (download last version of JCIFS library, uncomment the windows share line in the connectors config file)
  18. Start 13:44 to 14:03 Authority connector should be belong to an authority group
  19. Start 14:03 to 14:18 That’s it for manifold
  20. Start 14:18 to 15:00 Told you Front end is in charge of the authentication LDAP protocol to authenticate TomcatSPNEGO (Active directory). Spénégo : use single sign on
  21. Start 15:00 to 15:57 Add fields that will contains the access control list of the document Declare the MCF plugin Configure the endpoint of the authority service Add a filter query that uses this plugin in your search handler This is for the search handler
  22. Start 15:57 to 16:30 For the update handler. It is a default extracting handler that integrate apache tika. As a reminder, since Solr 5, extracting handler can run tesseract to extract content from images. Solr can do this job.
  23. Start 16:30 to 17:22 In new version of Manifold. It can also be done In fact, processing pipeline. You can do field mapping but also tika extraction. Perfect if you don’t want to send big files over the network
  24. Start 17:22 to 18:11 Now we will try to understand what is going on under the hood during our crawl These two crawling models are available with manifoldCF. To avoid indexing Discover new documents, remove old ones.
  25. Start 18:11 to 18:33 Some repository works well with incremental crawling Others don’t Unfortunatly our windows share won’t be able to answer
  26. Start 18:33 to 20:00 Therefore JCIFS connector How do windows share connetor handle incremental If it is a file Next slide is version attribute Fetch from the windows share
  27. Start 20:00 to 20:50 For each document Last version Depends on the repository
  28. Start 20:50 to 21:04 This was for step 1. more We can repeat these 2 steps in order to keep our data syncrhonize. We have covered how to configure this and we ve describe of it works under the hood. Now it is in production mode and you want to be sure of what is going on
  29. Start 21:04 to 22:33 Many informations UI or API Send alert if something went wrong or just if the crawl is finished
  30. Start 22:33 to 23:00 You also have a tab that shows you an history of all the different activities. Document status, for example if you want to see if a document has already been ingested in the current crawl Maximum bandwitch will give you information of crawling performance
  31. Start 23:00 to 23:16 Unfortunatly somtimes facing obvious Crawled repository that is overloaded It can be because of the network. You should packet with wireshark Solr server : for example if the autocommit frequency is too high. Mcf database is an important component, be sure that you followed the best pratices in the documentation
  32. Start 23:16 to 24:25 Maybe it is because of the configuration of your connector Two main parameters that can have an impact on performance Throttling : Fixing hard limit on fetching document (usefull if you are doing web crawling don’t don’t to be ban by the webmaster) Max connections that will be done to the system. It can be a good idea if we want to do web crawling to increase this value But windows share won’t work very well with a of connection, so in our scenario we should use a small value
  33. Start 24:25 to 25:15 In the job settings, you can filter document that you want to index For an intranet file share, you probably don’t want to index the last Star wars movie that an employee wanted to share with their colleagues
  34. Start 25:15 to 26:00 That was some example of performance issues. But unfortunatly, It can be even worst, you can face errors If you are facing errors A thread dump can give you information on
  35. Start 26:00 to 28:08 One common problem is when the account you use for crawl doesn’t It must be able to read everything and to read ACLs for each file It can need special right, such as Print operator. As we just saw, we can use exclusion regex or size limit Be also sure to add ignore tika exception in solr JCIFS errors linked or not to network issues timeout. Sometimes be solve while increase jcifs timeout Sometimes you can have to increase solr time out issues Big processing
  36. Start 28:08 to 28:45 What can happen in real life To conclude. Massive web crawling : Nutch is the best tool for you Then, go for it.
  37. Start 28:45 to 29:18 Here are some references That is now freely available You can have a look at our blog posts, that you how to run through the different steps that I covered in the file search scenario I described
  38. Start 29:18 to 29:40 If you are too lazy to integrate Solr and ManifoldCF by yourself
  39. Start 29:40 to 30:00 Thank you very much for your attention, Be pleased to answer any question you may have