SlideShare uma empresa Scribd logo
1 de 36
SOLUTION TRACK
Finding the Needle in a Big Data Haystack
@EvaAndreasson, Innovator & Problem Solver
Cloudera
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/search-hadoop-data-hub
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Problem (Solving)
• Apache Solr + Apache Hadoop et al
• Real-world examples
• Q&A
Problem Solving
Does it
Work?
Will You
Get in
Trouble
Information Driven Problem Solving
• Ask a Question
• Find All Relevant Data to Serve the Question
• Process the Data to Answer the Question
Information Driven Businesses
Thousands
of employees &
lots of data
Difficult to access
Heterogeneous
legacy IT
Infrastructure
Difficult to manage
Hard to scale
Silos of multi-
structured data
Difficult to Integrate
Holds copies of data
Problem: Finding the Data (Needle) Across (Hay) Silos
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Data
Archives
EDWs Marts SearchServers Document Stores Storage
©2014 Cloudera, Inc. All Rights Reserved. Reproduction
or redistribution without written permission is prohibited.
Independent of
audience, topic, or
tool, data is
accessible
Unified data storage,
processing,
management and
security
Ingest all data
any type or any scale
Eliminate copies,
simplify aggregation
and correlation
EDWs Marts Storage Search
Solution: The Enterprise Data Hub (EDH)
Servers Documents
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
EDH
Archives
©2014 Cloudera, Inc. All Rights Reserved. Reproduction
or redistribution without written permission is prohibited.
EDH
Apache Hadoop et al at the Core
• Open source
• white box, best innovation at all times
• Flexible
• Scalable
• ingest, storage, processing
• Cost efficient
But an EDH also Needs…
• Security & audit
• Manageability, Visibility and Resource Control
• Open architecture
• Multi-workload support and optimization
Problem solved!
We can go home…?
Problem: Finding the Needle in a Big Data Haystack?
New Audiences, New Challenges
• Non-technical staff needs access to data
• Same data used in bigger processes
• Speed up manual introspection
• Technical staff needs to view / explore data
• View interim results
• Drill down into mission critical data
• Explore data to design models
• Cross-workload needs
• Combine structured and unstructured data
Solution: Everyone Knows Search!
Explore
Navigate
Correlate
15
Cloudera Search: How We
Integrated Solr with Hadoop et al
Cloudera Search
Interactive search for Hadoop
• Full-text and faceted navigation
• Batch, near real-time, and on-demand indexing
16
Apache Solr integrated with CDH
• Established, mature search with vibrant community
• Separate runtime like MapReduce, Impala
• Incorporated as part of the Hadoop ecosystem
Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
Scalable and Robust Index Storage
HDFS
Lucene
Extraction Mapping
Solr
Zookeeper
SolrCloud
Querying API Indexing API
17
Solr and HDFS
• Scalable, cost-efficient
index storage
• Higher availability
• Search and process data
in one platform
Near Real Time Indexing at Ingest
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
• Document-level ACL
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
18
Scalable Batch Indexing
Files
Files
SolrCloud
Cluster
19
HDFS
Solr and MapReduce
• Flexible, scalable batch
indexing
• GOLIVE: Start serving new
indices with no downtime
• On-demand indexing, cost-
efficient re-indexing
Index
shard
Index
shard
MR
Indexer
MR
Indexer
Scalable Batch Indexing
20
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Index
shard 1
Index
shard 2
Arbitrary reducing steps of indexing and merging
End-Reducer (shard 1):
Index document
End-Reducer (shard 2):
Index document
HBase
Secondary Indexes, in Real-Time
interactiveload
Replication
Listener(s)
(Lily)
Triggers on
updates
Solr server
Solr server
Solr server
Solr server
Solr server
HDFS
HBase
Solr and HBase
• Secondary Indexes made
easy and flexible
HBase
Secondary Indexes, or in Batch
interactiveload
Replication
Listener(s)
(Lily)
Triggers on
updates
Solr server
Solr server
Solr server
Solr server
Solr server
MR
Indexer
HDFS
MR
Indexer
HBase
Index
shard
MR
Indexer
Solr and HBase
• Real Time or Batch
Streamlined Extraction and Mapping
Morphlines
• Simple and flexible data
transformation
• Reusable across multiple
index (and other) workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
Security
Cloudera Search + Sentry
• Cluster level access control
• Index level access control
Simple, Customizable Search UI
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Simplified Management
Cloudera Manager
• Install, configure, deploy
SolrCloud on the cluster
• Centralized management and
monitoring – cross workloads
• Unified resource management
and control
Data
End User Client App
(e.g. Hue)
Flume
HDFS
Raw, filtered, or
annotaded data
SolrCloud Cluster(s)
Data to be
indexed
Indexed data
MapReduce Batch Indexing
GoLive updates
HBase
Cluster
Replication
Events to be
indexed
Data
ClouderaManager
Search queries
Architecture Overview
Sentry Authorization
Some Real-World Examples
• Image processing and data correlation
• Quick result checks on new algorithms
• Correlate image and device logs or external data
• Image exploration as a service
• Expedite data modeling processes
• Free-text form matching
• Claims report correlation, to gather data likely to be similar
• Serve 360-client or patient records in a speedier way
• Fraud / pattern extraction
• Log management
• Real-time drill down
• Long term trending and capacity planning
• Anomaly detection over larger sets of data
The EDH - Information-Driven Problem Solving Made Easy!
Integration is Key
• Eliminate data moves or copies, break silos
• Create a truly active archive
• Serve non-technical audiences on the same platform
as where advanced analytics workloads run
• Analyze and combine structured and non-structured
data
• Expedite exploration of various data types
• Find data and do something with it – where it is
stored
• Future proof your data management system
Learn More
• Cloudera.com
• Read our blog
• Take our online training (or get Cloudera certified)
• Download whitepapers
• View webinars
• Talk to our customers
• Follow/contact me
• @EvaAndreasson
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/search-
hadoop-data-hub

Mais conteúdo relacionado

Mais de C4Media

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 

Mais de C4Media (20)

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Finding the Needle in a Big Data Haystack

  • 1. SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /search-hadoop-data-hub
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Problem (Solving) • Apache Solr + Apache Hadoop et al • Real-world examples • Q&A
  • 6. Information Driven Problem Solving • Ask a Question • Find All Relevant Data to Serve the Question • Process the Data to Answer the Question
  • 8. Thousands of employees & lots of data Difficult to access Heterogeneous legacy IT Infrastructure Difficult to manage Hard to scale Silos of multi- structured data Difficult to Integrate Holds copies of data Problem: Finding the Data (Needle) Across (Hay) Silos ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources Data Archives EDWs Marts SearchServers Document Stores Storage ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited.
  • 9. Independent of audience, topic, or tool, data is accessible Unified data storage, processing, management and security Ingest all data any type or any scale Eliminate copies, simplify aggregation and correlation EDWs Marts Storage Search Solution: The Enterprise Data Hub (EDH) Servers Documents ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources EDH Archives ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited. EDH
  • 10. Apache Hadoop et al at the Core • Open source • white box, best innovation at all times • Flexible • Scalable • ingest, storage, processing • Cost efficient
  • 11. But an EDH also Needs… • Security & audit • Manageability, Visibility and Resource Control • Open architecture • Multi-workload support and optimization
  • 12.
  • 13. Problem solved! We can go home…?
  • 14. Problem: Finding the Needle in a Big Data Haystack?
  • 15. New Audiences, New Challenges • Non-technical staff needs access to data • Same data used in bigger processes • Speed up manual introspection • Technical staff needs to view / explore data • View interim results • Drill down into mission critical data • Explore data to design models • Cross-workload needs • Combine structured and unstructured data
  • 16. Solution: Everyone Knows Search! Explore Navigate Correlate
  • 17. 15 Cloudera Search: How We Integrated Solr with Hadoop et al
  • 18. Cloudera Search Interactive search for Hadoop • Full-text and faceted navigation • Batch, near real-time, and on-demand indexing 16 Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 19. Scalable and Robust Index Storage HDFS Lucene Extraction Mapping Solr Zookeeper SolrCloud Querying API Indexing API 17 Solr and HDFS • Scalable, cost-efficient index storage • Higher availability • Search and process data in one platform
  • 20. Near Real Time Indexing at Ingest Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest • Document-level ACL HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 18
  • 21. Scalable Batch Indexing Files Files SolrCloud Cluster 19 HDFS Solr and MapReduce • Flexible, scalable batch indexing • GOLIVE: Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing Index shard Index shard MR Indexer MR Indexer
  • 22. Scalable Batch Indexing 20 Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • 23. HBase Secondary Indexes, in Real-Time interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server HDFS HBase Solr and HBase • Secondary Indexes made easy and flexible
  • 24. HBase Secondary Indexes, or in Batch interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server MR Indexer HDFS MR Indexer HBase Index shard MR Indexer Solr and HBase • Real Time or Batch
  • 25. Streamlined Extraction and Mapping Morphlines • Simple and flexible data transformation • Reusable across multiple index (and other) workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document
  • 26. Security Cloudera Search + Sentry • Cluster level access control • Index level access control
  • 27. Simple, Customizable Search UI Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 28. Simplified Management Cloudera Manager • Install, configure, deploy SolrCloud on the cluster • Centralized management and monitoring – cross workloads • Unified resource management and control
  • 29. Data End User Client App (e.g. Hue) Flume HDFS Raw, filtered, or annotaded data SolrCloud Cluster(s) Data to be indexed Indexed data MapReduce Batch Indexing GoLive updates HBase Cluster Replication Events to be indexed Data ClouderaManager Search queries Architecture Overview Sentry Authorization
  • 30. Some Real-World Examples • Image processing and data correlation • Quick result checks on new algorithms • Correlate image and device logs or external data • Image exploration as a service • Expedite data modeling processes • Free-text form matching • Claims report correlation, to gather data likely to be similar • Serve 360-client or patient records in a speedier way • Fraud / pattern extraction • Log management • Real-time drill down • Long term trending and capacity planning • Anomaly detection over larger sets of data
  • 31. The EDH - Information-Driven Problem Solving Made Easy!
  • 32. Integration is Key • Eliminate data moves or copies, break silos • Create a truly active archive • Serve non-technical audiences on the same platform as where advanced analytics workloads run • Analyze and combine structured and non-structured data • Expedite exploration of various data types • Find data and do something with it – where it is stored • Future proof your data management system
  • 33. Learn More • Cloudera.com • Read our blog • Take our online training (or get Cloudera certified) • Download whitepapers • View webinars • Talk to our customers • Follow/contact me • @EvaAndreasson
  • 34.
  • 35.
  • 36. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/search- hadoop-data-hub