Mais conteúdo relacionado Semelhante a The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics (20) Mais de lucenerevolution (20) The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics1. The Search Is Over:
Integrating SOLR and Hadoop to
Simplify Big Data Analytics
©MapR Technologies - Confidential 1
2. Evolution of Search
Documents
•Models
•Feature Selection
User
Content Interaction
Relationships •Clicks
•Page Rank, etc. •Ratings/Reviews
•Organization •Learning to Rank
•Social Graph
Queries
•Phrases
•NLP
©MapR Technologies - Confidential 2
4. Data is Growing Quickly
Business Analytics Requires a New Approach
Data Volume
Growing 44x
2010:
1.2
Zettabytes 2020: 35.2
Zettabytes IDC
Digital Universe
Study 2011
Data is Growing Faster than Moore’s Law
©MapR Technologies - Confidential 4
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
5. MapReduce: A Paradigm Shift
Distributed computing platform
– Large clusters
– Commodity hardware
Pioneered at Google
– Bigtable and Google File System
Commercially available as Hadoop
©MapR Technologies - Confidential 5
7. How does Map/Reduce work?
1. Map
– Spread data across servers based on key/value pairs
– Each node independently scans local data
2. Servers produce Map results
3. Reduce - combine/merge Map results
4. Process complete or Map a new function
Like shuffling
multiple decks
of playing
cards
©MapR Technologies - Confidential 7
8. The Cost of Enterprise Storage
SAN Storage NAS Filers Local Storage
$2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte
$1M gets: $1M gets: $1M gets:
0.5Petabytes 1 Petabyte 20 Petabytes
200,000 IOPS 400,000 IOPS 10,000,000 IOPS
1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec
©MapR Technologies - Confidential 8
9. Deep Object Store
Billions and Billions of Files
For some use cases it’s not the storage
capacity it’s the number of objects
– Messages
– Attachments
– Images
– Recordings
Provides a deep storage pool that is analytic ready
– Store it until you need it
– Derive secondary value from analytic processing
Makes more sense to perform analytics on the data and
send results over the network
©MapR Technologies - Confidential 9
9
10. Problems with Integrating Solr with Hadoop
Simple to integrate with Hadoop as a data source
Difficult to integrate distributed search and scale
SolrCloud simplifies Sharding and Replication coordination
Integration limitations based on capabilities of large scale storage
– High availability
– Data protection
– Ease of Access
©MapR Technologies - Confidential 10
11. Sharded text Indexing
Assign documents Index text to local disk
to shards and then copy index to
distributed file store
Clustered
Reducer index storage
Input Map
documents
Copy to local disk
Local
typically disk
required before Local Search
index can be loaded disk Engine
©MapR Technologies - Confidential 11
12. Problems with Solr and Hadoop
Failure of search
engine requires
Failure of a reducer another download
causes garbage to of the index from
accumulate in the clustered storage.
Clustered
local disk Reducer index storage
Input Map
documents
Local
disk Local Search
disk Engine
©MapR Technologies - Confidential 12
13. Limitations of HDFS
HDFS is Append Only NAS
appliance
Data Access is through the HDFS API
A B
High Availability is a challenge NameNode
Single points of failure
DataNode DataNode DataNode
Limited to 50-200 million files
Performance bottleneck DataNode DataNode DataNode
DataNode DataNode DataNode
©MapR Technologies - Confidential 13
14. Logs, Flume, aggregates incoming events to Solr –
Requires Multi-Step, Batch Process
Hadoop
Application Cluster
Server
Application
Server
Application
Server
©MapR Technologies - Confidential 14
15. What’s Required for SDA?
Ease of Data Access through Open Standards
Search
Large Scale, Reliable Storage
Ease of Integration Analytics Discovery
– Management ( REST)
– Security (LDAP, NIS, Linux PAM…)
– Analytics (NFS, ODBC, HDFS)
©MapR Technologies - Confidential 15
16. Ease of Data Access
HDFS ENTERPRISE
API NFS Access
©MapR Technologies - Confidential 16
17. Multiple Architectures Possible
Export to the world
– NFS gateway runs on selected gateway hosts
Local server
– NFS gateway runs on local host
– Enables local compression and check summing
Export to self
– NFS gateway runs on all data nodes, mounted from localhost
©MapR Technologies - Confidential 17
18. Data Access through Standard Protocols
NFS
NFS
Server
NFS
Server
NFS
Server
NFS Server
Client
©MapR Technologies - Confidential 18
19. NFS Access through a Local server
Application
NFS
Server
Client
Cluster
Nodes
©MapR Technologies - Confidential 19
20. Universal export to self
Cluster Nodes
Task
NFS
Cluster Server
Node
©MapR Technologies - Confidential 20
21. Nodes are identical
Task
Task
NFS
NFS
Cluster Server
Node Cluster Server
Node
Task
NFS
Cluster Server
Node
©MapR Technologies - Confidential 21
22. Simplifies Solr Hadoop Integration
Search
Engine
Reducer
Input Map Clustered
documents
index storage
Failure of a reducer Search engine
is cleaned up by reads mirrored
map-reduce index directly.
framework
©MapR Technologies - Confidential 22
23. How Does this Integration Happen?
Elegantly simple
Direct Integration a result of leveraging architectures
Data in the Hadoop cluster is written to a Volume
Solr Crawler discovers content being entered into
Hadoop
Accesses the data in the cluster through NFS
Builds Search Index
Users access Solr to find data directly into Hadoop
©MapR Technologies - Confidential 23
24. Distributed Shard Indexing
shard#1,doc
doc1
1
doc2 shard#1,[doc3,doc1]
shard#2,doc
doc3 shard#2,[doc2] index/s1
2
shard#3, [doc5]index/s2
shard#1,doc
… index/s3
3
shard#3,doc …
Input Map 4 Combine
Shuffle Reduce Output
and sort
shard#3,doc
5 Reduce
…
©MapR Technologies - Confidential 24
24
25. How Does this Work at Scale with
Distributed Indices?
MapReduce jobs analyze distributed, disparate data in a cluster
In distributed indexing, the input is split arbitrarily into chunks
and each chunk is handled separately. There can be many more
chunks than there are shards to be created.
Mapper assigns document to shard
– Shard is usually hash of document id
Reducer indexes all documents for a shard
– Indexes created on local disk
– On success, copy index to DFS
Zookeeper is used to manage Solr instances
A large Solr Search is distributed across multiple shards
©MapR Technologies - Confidential 25
26. What about HA and Data Protection?
Cluster Capabilities can Extend to Integrated Search and Discovery
Reliable Compute Dependable Storage
Automated re-replication Business continuity with snapshots
and mirrors
Self-healing from HW and SW failures
Recover to a point in time
Load balancing
End-to-end check summing
Rolling upgrades
Strong consistency
No lost jobs or data
Mirror across sites to meet
99999’s of uptime
Recovery Time Objectives
©MapR Technologies - Confidential 26
27. MapReduce failure to write the Index
Highly Available JobTracker and TaskTracker ensures
that any failures are recovered with state to
completion
MapReduce will clean up partially written indexes
No administrator intervention required
©MapR Technologies - Confidential 27
28. Solr Node Fails
Other Solr nodes start
serving shards that
were being served by
failed node
©MapR Technologies - Confidential 28
29. Node Containing the Index Fails
Data is already replicated across the cluster
Zookeeper assigns Solr instance on the replicated node to the
replicated shard
©MapR Technologies - Confidential 29
30. Additional High Availability and Replication
Snapshots are available
Administrator sets frequency at the Volume
Snapshots with automatic
de-duplication
Saves space by sharing blocks
Redirect on write, fast with no performance or
storage penalty
Zero performance loss on writing to original
Scheduled, or on-demand
Easy recovery with drag and drop
©MapR Technologies - Confidential 30
31. Mirroring Support in Hadoop Cluster
Business Continuity
and Efficiency
Production Research
Efficient design
Differential deltas are updated
Datacenter 1
WAN
Datacenter 2 Compressed and
check-summed
Easy to manage
WAN
Production Scheduled or on-demand
EC2
WAN, Remote Seeding
Consistent point-in-time
©MapR Technologies - Confidential 31
32. Simplified NFS data flows for Distributed
Search
Search
Mirroring allows Engine
exact placement
of index data
Reducer
Input Map
documents Search
Engine
Aribitrary levels
of replication
also possible Mirrors
©MapR Technologies - Confidential 32
33. Improving Search Relevancy
Requires a continuous Feedback
Loop Search
– The quality of the search is
influenced by the end-user
selections Analytics Discovery
– Fully automated process that
improves with use
– Does not require manual tags or
classification
©MapR Technologies - Confidential 33
34. Recommendations
Often referred to as collaborative filtering
Actors interact with items
– observe successful interaction
We want to suggest additional successful interactions
Observations inherently very sparse
©MapR Technologies - Confidential 34
35. Examples
Customers buying books (Linden et al)
Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix)
Internet radio listeners not skipping songs (Musicmatch)
Internet video watchers watching >30 s
©MapR Technologies - Confidential 35
36. Examples
Query for Friends results in links to Seinfeld
Search for kittens, get results for baby otters
©MapR Technologies - Confidential 36
37. Dyadic Structure
Functional
– Interaction: actor -> item*
Relational
– Interaction ⊆ Actors x Items
Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
Predict missing observations
©MapR Technologies - Confidential 37
38. Fundamental Algorithmics
Co-occurrence
A is actors x items, K is items x items
Product has general shape of matrix
K tells us “users who interacted with x also interacted with y”
©MapR Technologies - Confidential 38
39. Why not Expand it?
Users enter queries (A)
– (actor = user, item=query)
Users view videos (B)
– (actor = user, item=video)
A’A gives query recommendation
– “did you mean to ask for”
B’B gives video recommendation
– “you might like these videos”
©MapR Technologies - Confidential 39
40. The punch-line
B’A recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
©MapR Technologies - Confidential 40
41. Real-life example
Query: “Paco de Lucia”
Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
©MapR Technologies - Confidential 41
43. The Search for Relevancy
Updating Search to Reflect Relevancy
– Big Map Reduce jobs can use behaviorial traces in logs to improve results
and identify Importance
Search
Analytics Discovery
The power of this virtuous loop depends on ease of frictionless
data access, high availability, performance
©MapR Technologies - Confidential 43