The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Scalable vertical search engine with hadoop
1. Hadoop use case: A scalable
vertical search engine
Iván de Prado Alonso, Datasalt Co-founder
Twitter: @ivanprado
2. Content
§ The problem
§ The obvious solution
§ When the obvious solution fails…
§ … Hadoop comes to the rescue
§ Advantages & disadvantages
§ Improvements
3. ¿What is a vertical search
engine?
Provider 1
Vertical Search Engine
Feed
s
rche
Se a
Provider 2
Sear
ches
ed
Fe
5. The “obvious” architecture
The first thing that comes to your mind
Feed
Does it exist?
Has it changed?
Insert/update Database
Download &
Process
Insert/update
Lucene/Solr Search Page
Index
6. How it works
§ Feed download
§ For every register in the feed
• Check for existence in the DB
• If it exists and has changed, update
ª The DB
ª The Index
• If it doesn’t exist, insert into
ª The DB
ª The Index
7. How it works (II)
§ The Database is used for
• Checking for register existence (avoiding
duplicates)
• Managing the data with SQL facility
§ Lucene/Solr is used for
• Quick searches
• Searching by structured fields
• Free-text searches
• Faceting
10. “Swiss army knife of the 21st
century”
Media Guardian Innovation Awards
http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
11. Hadoop
“The Apache Hadoop
software library is a
framework that allows for
the distributed processing
of large data sets across
clusters of computers using
a simple programming
model”
From Hadoop homepage
12. File System
§ Distributed File System (HDFS)
• Cluster of nodes exposing their storage
capacity
• Big blocks: 64 Mb
• Fault tolerant (replication)
• Big files storage
13. MapReduce
§ Two functions (Map y Reduce)
• Map(k, v) : [z,w]*
• Reduce(z, w*) : [u, v]*
§ Example: word count
• Map([document, null]) -> [word, 1]*
• Reduce(word, 1*) -> [word, total]
§ MapReduce & SQL
• SELECT word, count(*) GROUP BY word
§ Distributed execution on a cluster
§ Horizontal scalability
15. Because…
§ Hadoop is not a Database
§ Hadoop “apparently” only
processes data
§ Hadoop does not allow “lookups”
Hadoop is a paradigm shift difficult to
assimilate
17. Philosophy
§ Always reprocess everything. ¡EVERYTHING!
§ ¿Why?
• More bug tolerant
• More flexible
• More efficient. E.g.:
ª With a 7200 RPM HD
– Random IOPS – 100
– Sequencial Read/Write – 40 MB/s
– Hypothesis: 5 Kb register size
ª … it is faster to rewrite all data than to perform random updates when
more than 1.25% of the registers has changed.
– 1 GB, 200.000 registers
» Sequential writing: 25 sg
» Random writing: 33 min!
18. Fetcher
Feeds are downloaded and stored in the HDFS.
§ MapReduce
• Input: [feed_url, null]*
Reducer Task
• Mapper: identity
• Reducer(feed_url, Reducer Task
HDFS
null*)
ª Download the Reducer Task
feed_url and store it
in a HDFS folder
19. Processor
Feeds are parsed, converted into documents and
deduplicated
§ MapReduce
• Input: [feed_path, null]*
• Map(feed_path, null) : [id, documents]*
ª The feed is parsed and converted into documents
• Reducer(id, [document]*): [id, document]
ª Receives a list of documents and keeps the most
recent one (deduplication)
ª A unique and global identifier is required
(idProvider + idInternal)
• Output: [id, document]*
20. Processor (II)
§ Possible problem:
• Very large feeds
ª Does not scale, as one task will deal with the
full feed.
§ Solution
• Write a custom InputFormat that divides
the feed in smaller pieces.
22. Indexer
Production Solr
Hot swap
Reducer Task
Index - Shard 1
Index - Shard 1
Web Server
Reducer Task
Hot swap
Index - Shard 2
Index - Shard 2
Reducer Task
Web Server
Hot swap
Index - Shard 3
Index - Shard 3
23. Indexer (II)
§ SOLR-1301
• https://issues.apache.org/jira/browse/SOLR-1301
• SolrOutputFormat
• 1 index per reducer
• A custom Partitioner can be used to control where to
place each document
§ Another option
• Writing your own indexation code
ª By creating a custom output format
ª By Indexing at the reducer level. In each reduce call:
– Open an index
– Write all incoming registers
– Close the index
24. Search & Partitioning
§ Different partitioning schemas
• Horizontal
ª Each search involves all shards
• Vertical: by ad type, country, etc.
ª Searches can be restricted to the involved shard
§ Solr for index serving. Possibilities:
ª Non federated Solr
– Only for vertical partitioning
ª Distributed Solr
ª Solr Cloud
25. Reconciliation
From Fetcher Reconciliation Next steps
Reconciliated
documents
Last execution !le
§ ¿How to register changes?
• Changes in price, features, etc.
• MapReduce:
ª Input: [id, document]*
– From last execution
– From current processing
ª Map: identity
ª Reduce(id, [document]*) : [id, document]
– Documents grouped by ID. New and old documents come together.
– New and old documents are compared.
– The relevant information is stored in the new document (e.g, the old price)
– Only the new document is emited.
§ This is the closest thing in Hadoop to a DB
26. Advantages of the architecture
§ Horizontal Scalability
• If properly programmed
§ High tolerance to failures and bugs
• Always everything is reprocessed
§ Flexible
• It is easy to do big changes
§ High decoupling
• Indexes are the unique interaction between the
back-end and the front-end
• Web servers can keep running even if the back-
end is broken.
27. Disadvantages
§ Batch processing
• No real-time or “near” real-time
• Update cycles of hours
§ Completely different programming
paradigm
• High learning curve
28. Improvements
§ System for images
§ Fuzzy duplicates detection
§ Plasam:
• Mixing this architecture with a by-pass system
that provides near real time updates to the FE
indexes
ª Implementing a by-pass to the Solrs
ª System for ensuring data consistency
– Without back jumps in time
• That combines the advantages of the proposed
architecture but with near real time
• Datasalt has a prototype ready