SlideShare uma empresa Scribd logo
1 de 36
SOLUTION TRACK
Finding the Needle in a Big Data Haystack
@EvaAndreasson, Innovator & Problem Solver
Cloudera
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/search-hadoop-data-hub
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Problem (Solving)
• Apache Solr + Apache Hadoop et al
• Real-world examples
• Q&A
Problem Solving
Does it
Work?
Will You
Get in
Trouble
Information Driven Problem Solving
• Ask a Question
• Find All Relevant Data to Serve the Question
• Process the Data to Answer the Question
Information Driven Businesses
Thousands
of employees &
lots of data
Difficult to access
Heterogeneous
legacy IT
Infrastructure
Difficult to manage
Hard to scale
Silos of multi-
structured data
Difficult to Integrate
Holds copies of data
Problem: Finding the Data (Needle) Across (Hay) Silos
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Data
Archives
EDWs Marts SearchServers Document Stores Storage
©2014 Cloudera, Inc. All Rights Reserved. Reproduction
or redistribution without written permission is prohibited.
Independent of
audience, topic, or
tool, data is
accessible
Unified data storage,
processing,
management and
security
Ingest all data
any type or any scale
Eliminate copies,
simplify aggregation
and correlation
EDWs Marts Storage Search
Solution: The Enterprise Data Hub (EDH)
Servers Documents
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
EDH
Archives
©2014 Cloudera, Inc. All Rights Reserved. Reproduction
or redistribution without written permission is prohibited.
EDH
Apache Hadoop et al at the Core
• Open source
• white box, best innovation at all times
• Flexible
• Scalable
• ingest, storage, processing
• Cost efficient
But an EDH also Needs…
• Security & audit
• Manageability, Visibility and Resource Control
• Open architecture
• Multi-workload support and optimization
Problem solved!
We can go home…?
Problem: Finding the Needle in a Big Data Haystack?
New Audiences, New Challenges
• Non-technical staff needs access to data
• Same data used in bigger processes
• Speed up manual introspection
• Technical staff needs to view / explore data
• View interim results
• Drill down into mission critical data
• Explore data to design models
• Cross-workload needs
• Combine structured and unstructured data
Solution: Everyone Knows Search!
Explore
Navigate
Correlate
15
Cloudera Search: How We
Integrated Solr with Hadoop et al
Cloudera Search
Interactive search for Hadoop
• Full-text and faceted navigation
• Batch, near real-time, and on-demand indexing
16
Apache Solr integrated with CDH
• Established, mature search with vibrant community
• Separate runtime like MapReduce, Impala
• Incorporated as part of the Hadoop ecosystem
Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
Scalable and Robust Index Storage
HDFS
Lucene
Extraction Mapping
Solr
Zookeeper
SolrCloud
Querying API Indexing API
17
Solr and HDFS
• Scalable, cost-efficient
index storage
• Higher availability
• Search and process data
in one platform
Near Real Time Indexing at Ingest
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
• Document-level ACL
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
18
Scalable Batch Indexing
Files
Files
SolrCloud
Cluster
19
HDFS
Solr and MapReduce
• Flexible, scalable batch
indexing
• GOLIVE: Start serving new
indices with no downtime
• On-demand indexing, cost-
efficient re-indexing
Index
shard
Index
shard
MR
Indexer
MR
Indexer
Scalable Batch Indexing
20
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Index
shard 1
Index
shard 2
Arbitrary reducing steps of indexing and merging
End-Reducer (shard 1):
Index document
End-Reducer (shard 2):
Index document
HBase
Secondary Indexes, in Real-Time
interactiveload
Replication
Listener(s)
(Lily)
Triggers on
updates
Solr server
Solr server
Solr server
Solr server
Solr server
HDFS
HBase
Solr and HBase
• Secondary Indexes made
easy and flexible
HBase
Secondary Indexes, or in Batch
interactiveload
Replication
Listener(s)
(Lily)
Triggers on
updates
Solr server
Solr server
Solr server
Solr server
Solr server
MR
Indexer
HDFS
MR
Indexer
HBase
Index
shard
MR
Indexer
Solr and HBase
• Real Time or Batch
Streamlined Extraction and Mapping
Morphlines
• Simple and flexible data
transformation
• Reusable across multiple
index (and other) workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
Security
Cloudera Search + Sentry
• Cluster level access control
• Index level access control
Simple, Customizable Search UI
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Simplified Management
Cloudera Manager
• Install, configure, deploy
SolrCloud on the cluster
• Centralized management and
monitoring – cross workloads
• Unified resource management
and control
Data
End User Client App
(e.g. Hue)
Flume
HDFS
Raw, filtered, or
annotaded data
SolrCloud Cluster(s)
Data to be
indexed
Indexed data
MapReduce Batch Indexing
GoLive updates
HBase
Cluster
Replication
Events to be
indexed
Data
ClouderaManager
Search queries
Architecture Overview
Sentry Authorization
Some Real-World Examples
• Image processing and data correlation
• Quick result checks on new algorithms
• Correlate image and device logs or external data
• Image exploration as a service
• Expedite data modeling processes
• Free-text form matching
• Claims report correlation, to gather data likely to be similar
• Serve 360-client or patient records in a speedier way
• Fraud / pattern extraction
• Log management
• Real-time drill down
• Long term trending and capacity planning
• Anomaly detection over larger sets of data
The EDH - Information-Driven Problem Solving Made Easy!
Integration is Key
• Eliminate data moves or copies, break silos
• Create a truly active archive
• Serve non-technical audiences on the same platform
as where advanced analytics workloads run
• Analyze and combine structured and non-structured
data
• Expedite exploration of various data types
• Find data and do something with it – where it is
stored
• Future proof your data management system
Learn More
• Cloudera.com
• Read our blog
• Take our online training (or get Cloudera certified)
• Download whitepapers
• View webinars
• Talk to our customers
• Follow/contact me
• @EvaAndreasson
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/search-
hadoop-data-hub

Mais conteúdo relacionado

Mais de C4Media

Mais de C4Media (20)

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Finding the Needle in a Big Data Haystack

  • 1. SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /search-hadoop-data-hub
  • 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Problem (Solving) • Apache Solr + Apache Hadoop et al • Real-world examples • Q&A
  • 6. Information Driven Problem Solving • Ask a Question • Find All Relevant Data to Serve the Question • Process the Data to Answer the Question
  • 8. Thousands of employees & lots of data Difficult to access Heterogeneous legacy IT Infrastructure Difficult to manage Hard to scale Silos of multi- structured data Difficult to Integrate Holds copies of data Problem: Finding the Data (Needle) Across (Hay) Silos ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources Data Archives EDWs Marts SearchServers Document Stores Storage ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited.
  • 9. Independent of audience, topic, or tool, data is accessible Unified data storage, processing, management and security Ingest all data any type or any scale Eliminate copies, simplify aggregation and correlation EDWs Marts Storage Search Solution: The Enterprise Data Hub (EDH) Servers Documents ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources EDH Archives ©2014 Cloudera, Inc. All Rights Reserved. Reproduction or redistribution without written permission is prohibited. EDH
  • 10. Apache Hadoop et al at the Core • Open source • white box, best innovation at all times • Flexible • Scalable • ingest, storage, processing • Cost efficient
  • 11. But an EDH also Needs… • Security & audit • Manageability, Visibility and Resource Control • Open architecture • Multi-workload support and optimization
  • 12.
  • 13. Problem solved! We can go home…?
  • 14. Problem: Finding the Needle in a Big Data Haystack?
  • 15. New Audiences, New Challenges • Non-technical staff needs access to data • Same data used in bigger processes • Speed up manual introspection • Technical staff needs to view / explore data • View interim results • Drill down into mission critical data • Explore data to design models • Cross-workload needs • Combine structured and unstructured data
  • 16. Solution: Everyone Knows Search! Explore Navigate Correlate
  • 17. 15 Cloudera Search: How We Integrated Solr with Hadoop et al
  • 18. Cloudera Search Interactive search for Hadoop • Full-text and faceted navigation • Batch, near real-time, and on-demand indexing 16 Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 19. Scalable and Robust Index Storage HDFS Lucene Extraction Mapping Solr Zookeeper SolrCloud Querying API Indexing API 17 Solr and HDFS • Scalable, cost-efficient index storage • Higher availability • Search and process data in one platform
  • 20. Near Real Time Indexing at Ingest Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest • Document-level ACL HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 18
  • 21. Scalable Batch Indexing Files Files SolrCloud Cluster 19 HDFS Solr and MapReduce • Flexible, scalable batch indexing • GOLIVE: Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing Index shard Index shard MR Indexer MR Indexer
  • 22. Scalable Batch Indexing 20 Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • 23. HBase Secondary Indexes, in Real-Time interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server HDFS HBase Solr and HBase • Secondary Indexes made easy and flexible
  • 24. HBase Secondary Indexes, or in Batch interactiveload Replication Listener(s) (Lily) Triggers on updates Solr server Solr server Solr server Solr server Solr server MR Indexer HDFS MR Indexer HBase Index shard MR Indexer Solr and HBase • Real Time or Batch
  • 25. Streamlined Extraction and Mapping Morphlines • Simple and flexible data transformation • Reusable across multiple index (and other) workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document
  • 26. Security Cloudera Search + Sentry • Cluster level access control • Index level access control
  • 27. Simple, Customizable Search UI Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 28. Simplified Management Cloudera Manager • Install, configure, deploy SolrCloud on the cluster • Centralized management and monitoring – cross workloads • Unified resource management and control
  • 29. Data End User Client App (e.g. Hue) Flume HDFS Raw, filtered, or annotaded data SolrCloud Cluster(s) Data to be indexed Indexed data MapReduce Batch Indexing GoLive updates HBase Cluster Replication Events to be indexed Data ClouderaManager Search queries Architecture Overview Sentry Authorization
  • 30. Some Real-World Examples • Image processing and data correlation • Quick result checks on new algorithms • Correlate image and device logs or external data • Image exploration as a service • Expedite data modeling processes • Free-text form matching • Claims report correlation, to gather data likely to be similar • Serve 360-client or patient records in a speedier way • Fraud / pattern extraction • Log management • Real-time drill down • Long term trending and capacity planning • Anomaly detection over larger sets of data
  • 31. The EDH - Information-Driven Problem Solving Made Easy!
  • 32. Integration is Key • Eliminate data moves or copies, break silos • Create a truly active archive • Serve non-technical audiences on the same platform as where advanced analytics workloads run • Analyze and combine structured and non-structured data • Expedite exploration of various data types • Find data and do something with it – where it is stored • Future proof your data management system
  • 33. Learn More • Cloudera.com • Read our blog • Take our online training (or get Cloudera certified) • Download whitepapers • View webinars • Talk to our customers • Follow/contact me • @EvaAndreasson
  • 34.
  • 35.
  • 36. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/search- hadoop-data-hub