5. FROM LEGACY TO BATCH
• Old architecture
• Why & when we changed
• Current architecture
• Hive, Pig & custom tools
• Migration process
6. OLD ARCHITECTURE
• Based on MySQL and PHP scripts
• Indexes created with DataImportHandler
Incoming data DataImportHandler
Lucene Indexes
MySQL
PHP Scripts
7. WHEN & WHY WE MOVED
• Sharded strategies are hard to maintain
• We had 10M rows in a single table
• Many processes working on MySQL databases
• We wanted a more maintainable codebase
• The solution was pretty obvious...
8. CURRENT ARCHITECTURE
• Based on Hadoop
• Batch process that reprocess all the ads...
• But needs to be aware of the previous execution!
• Hive & custom tools to know what happens
9. CURRENT ARCHITECTURE
Incoming data External Data Lucene Indexes
Deployment
Ad Processor Diff Matching Expiration Deduplication Indexing
t-1 Hadoop Cluster
Hive Stats
10. AD PROCESSOR
Incoming data • Converts text files to Thrift objects
• Checks that the ads are complete
• Searches for poisonwords
Ad Processor
• Checks the value ranges
Thrift • Parses text (dates, currencies, etc)
Objects
11. DIFF PHASE
ads t ads t-1
• Performs the diff between executions
Diff
• Merges the ads of both executions
ads t
12. MATCHING PHASE
ads External Data
• Extracts semantic information:
• Geographical information
• Cars makes and models
Matching
• Companies
enriched • ...
ads
13. EXPIRATION PHASE
ads
• Works as a filter
• Deletes:
Expiration
• Expired ads
• Incorrect ads
ads to be
indexed
14. DEDUPLICATION PHASE
• Duplicates are a big issue for us
ads
• Youcannot compare N ads against
each other
• Solution:
Deduplication
• Use heuristics to create “possible
duplicates” groups
deduplicated
ads • Compare all the ads of each group
15. INDEXING PHASE
ads • Is actually done with two phases
• First we create micro indexes
• We use Embedded Solr Server
Expiration
• Then we merge them
• Plain Lucene
Lucene Indexes
16. HIVE, PIG & CUSTOM TOOLS
• Critical:
• To know that is going on (control info)
• To debug
• To prototype new processes
• To understand your data
grep, cat
• To create reports
17. MIGRATION PROCESS
• Used Amazon EC2 to test different cluster configurations
• Maintained both systems running during one month
• Switched to the new system gradually, one country at a time
• Then we moved the cluster to our own servers
18. FROM BATCH
TO NEAR REAL-TIME
• Batch is not enough
• Storm for real time data processing
• HBase for data storage
• Zookeeper for systems coordination
• Putting it all together
• Batch and NRT. Mixed architecture
19. BATCH IS NOT ENOUGH
• Dataprocessing with map reduce scales well but takes time
and has latency
• Crunch documents in batch means wait until all is processed
• We want to show the user fresher results!
20. BATCH IS NOT ENOUGH
ZK
MR pipeline
HDFS Id tables
• Storm + HBase + Zookeeper looks like a good
Solr
feed !!!
Topology
ZK
Feeds Spouts Bolts Bolts Slaves
21. STORM - PROPERTIES
• Distributed real time computation system
• Fault tolerance
• Horizontal scalability
• Low latency
• Reliability
25. HBASE - PROPERTIES
• Distributed, sorted map datastore
• Automatic failover
• Rows are sorted
• Many columns per row
• Good Hadoop integration
26. HBASE - COMPONENTS
• Master
• Slave coordination and failure detection
• Admin features
• Region server (slaves)
27. ZOOKEEPER
• Highly available coordination system
• Used for locking, distributed configuration, leader election,
cluster management...
• Curator makes it easy for common algorithms
28. PUTTING IT ALL TOGETHER
ZK
MR pipeline
HDFS Id tables
Solr
Topology
ZK
Feeds Spouts Bolts processor Bolt Indexer Slaves
29. MIXED ARCHITECTURE
• Ifthe number of segments in the index gets too big is has an
impact in search performance
• Building
indexes in batch allows to keep small number of
segments
• Gives near real time updates and it’s tolerant to human error