SlideShare uma empresa Scribd logo
1 de 10
Indexing with solr search server and
hadoop framework.
indexing
• indexing collects, parses, and stores data to facilitate fast and
accurate information retrieval.
• The purpose of storing an index is to optimize speed and performance in
finding documents.
• Without an index, the search engine would scan every document.
• The additional computer storage required to store the index, as well as the
considerable increase in the time required for an update to take place, are
traded off for the time saved during information retrieval.
Why hadoop + solr ?
• Data set outgrows the storage capacity of a single physical machine.
• Distributed filesystems more complex than regular disk filesystems.
• Biggest challenges is making the filesystem tolerate node failure without
suffering data loss.
• Hadoop comes with a distributed filesystem called HDFS.
• HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
• Hadoop doesn’t require expensive, highly reliable hardware to run on.
Continue…
• A program written in other frameworks may require large amounts of
refactoring when scaling from ten to one hundred or one thousand
machines.
• This may involve having the program be rewritten several times
• Hadoop is specifically designed to have a very flat scalability curve.
• In Hadoop very little--if any--work is required for that same program to
run on a much larger amount of hardware.
• Hadoop platform will manage the data and hardware resources and
provide dependable performance growth proportionate to the number of
machines available.
Continue…
• Highly fault-tolerant
• Suitable for applications with large data sets
• A HTTP browser can be used to browse the files of a HDFS instance.
• Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
Solr
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML, JSON and HTTP
• Comprehensive HTML Administration Interfaces
• Linearly scalable, auto index replication, auto failover and recovery
• Near Real-time indexing
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
Solr cloud
• New in Solr 4.0
• Easier scaling
• Centralized config
• Fault tolerant indexing and querying
• Using Apache ZooKeeper as registry
slave
slave
slave
Solr server
Solr server
Solr server
master ZooKee
per
Solr cloud
Technology and Platform
Technology: Hadoop, Solr
Front End: Solr
Back End: Hadoop Framework, solr search
server
Thank you

Mais conteúdo relacionado

Mais procurados

Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
cacois
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq
 

Mais procurados (20)

Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
Hadoop
HadoopHadoop
Hadoop
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
 
What database
What databaseWhat database
What database
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Cosmos db
Cosmos dbCosmos db
Cosmos db
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive QueryInteractive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
Interactive ad-hoc analysis at petabyte scale with HDInsight Interactive Query
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Concepts on Hadoop
Concepts on HadoopConcepts on Hadoop
Concepts on Hadoop
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 

Semelhante a Indexing with solr search server and hadoop framework

Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 

Semelhante a Indexing with solr search server and hadoop framework (20)

Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Indexing with solr search server and hadoop framework

  • 1. Indexing with solr search server and hadoop framework.
  • 2. indexing • indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. • The purpose of storing an index is to optimize speed and performance in finding documents. • Without an index, the search engine would scan every document. • The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
  • 3. Why hadoop + solr ? • Data set outgrows the storage capacity of a single physical machine. • Distributed filesystems more complex than regular disk filesystems. • Biggest challenges is making the filesystem tolerate node failure without suffering data loss. • Hadoop comes with a distributed filesystem called HDFS. • HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. • Hadoop doesn’t require expensive, highly reliable hardware to run on.
  • 4. Continue… • A program written in other frameworks may require large amounts of refactoring when scaling from ten to one hundred or one thousand machines. • This may involve having the program be rewritten several times • Hadoop is specifically designed to have a very flat scalability curve. • In Hadoop very little--if any--work is required for that same program to run on a much larger amount of hardware. • Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available.
  • 5. Continue… • Highly fault-tolerant • Suitable for applications with large data sets • A HTTP browser can be used to browse the files of a HDFS instance. • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
  • 6. Solr • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML, JSON and HTTP • Comprehensive HTML Administration Interfaces • Linearly scalable, auto index replication, auto failover and recovery • Near Real-time indexing • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture
  • 7. Solr cloud • New in Solr 4.0 • Easier scaling • Centralized config • Fault tolerant indexing and querying • Using Apache ZooKeeper as registry
  • 8. slave slave slave Solr server Solr server Solr server master ZooKee per Solr cloud
  • 9. Technology and Platform Technology: Hadoop, Solr Front End: Solr Back End: Hadoop Framework, solr search server