From legacy, to batch, to near real-time

FROM LEGACY, TO BATCH,
TO NEAR REAL-TIME
Marc Sturlese, Dani Solà

WHO ARE WE?

• Marc Sturlese - @sturlese

• Backend engineer, focused on R&D

• Interests: search, scalability

• Dani Solà - @dani_sola

• Backend engineer

• Interests: distributed systems, data mining, search,...

TROVIT
Search engine for classiﬁeds: 6 verticals, 38 countries & growing

FROM LEGACY TO BATCH

• Old architecture

• Why & when we changed

• Current architecture

• Hive, Pig & custom tools

• Migration process

OLD ARCHITECTURE

• Based on MySQL and PHP scripts

• Indexes created with DataImportHandler

Incoming data DataImportHandler

Lucene Indexes
MySQL

PHP Scripts

WHEN & WHY WE MOVED

• Sharded strategies are hard to maintain

• We had 10M rows in a single table

• Many processes working on MySQL databases

• We wanted a more maintainable codebase

• The solution was pretty obvious...

CURRENT ARCHITECTURE

• Based on Hadoop

• Batch process that reprocess all the ads...

• But needs to be aware of the previous execution!

• Hive & custom tools to know what happens

CURRENT ARCHITECTURE

Incoming data External Data Lucene Indexes
Deployment

Ad Processor Diff Matching Expiration Deduplication Indexing

t-1 Hadoop Cluster

Hive Stats

AD PROCESSOR

Incoming data • Converts text ﬁles to Thrift objects

• Checks that the ads are complete

• Searches for poisonwords
Ad Processor
• Checks the value ranges

Thrift • Parses text (dates, currencies, etc)
Objects

DIFF PHASE

ads t ads t-1

• Performs the diff between executions

Diff
• Merges the ads of both executions

ads t

MATCHING PHASE

ads External Data
• Extracts semantic information:

• Geographical information

• Cars makes and models
Matching
• Companies

enriched • ...
ads

EXPIRATION PHASE

ads
• Works as a ﬁlter

• Deletes:

Expiration
• Expired ads

• Incorrect ads
ads to be
indexed

DEDUPLICATION PHASE

• Duplicates are a big issue for us
ads
• Youcannot compare N ads against
each other

• Solution:
Deduplication
• Use heuristics to create “possible
duplicates” groups
deduplicated
ads • Compare all the ads of each group

INDEXING PHASE

ads • Is actually done with two phases

• First we create micro indexes

• We use Embedded Solr Server
Expiration
• Then we merge them

• Plain Lucene

Lucene Indexes

HIVE, PIG & CUSTOM TOOLS

• Critical:

• To know that is going on (control info)

• To debug

• To prototype new processes

• To understand your data
grep, cat
• To create reports

MIGRATION PROCESS

• Used Amazon EC2 to test different cluster conﬁgurations

• Maintained both systems running during one month

• Switched to the new system gradually, one country at a time

• Then we moved the cluster to our own servers

FROM BATCH
TO NEAR REAL-TIME
• Batch is not enough

• Storm for real time data processing

• HBase for data storage

• Zookeeper for systems coordination

• Putting it all together

• Batch and NRT. Mixed architecture

BATCH IS NOT ENOUGH

• Dataprocessing with map reduce scales well but takes time
and has latency

• Crunch documents in batch means wait until all is processed

• We want to show the user fresher results!

BATCH IS NOT ENOUGH
ZK
MR pipeline

HDFS Id tables
• Storm + HBase + Zookeeper looks like a good
Solr
feed !!!
Topology

ZK

Feeds Spouts Bolts Bolts Slaves

STORM - PROPERTIES

• Distributed real time computation system

• Fault tolerance

• Horizontal scalability

• Low latency

• Reliability

STORM - COMPONENTS

• Tuple

• Stream

• Spout

• Bolt

• Topology

STORM IN ACTION
Spouts Bolts Bolts

Streams
of
tuples

Queue Topology DataStore

STORM - DAEMONS

• Nimbus

• Supervisors

• Workers

HBASE - PROPERTIES

• Distributed, sorted map datastore

• Automatic failover

• Rows are sorted

• Many columns per row

• Good Hadoop integration

HBASE - COMPONENTS

• Master

• Slave coordination and failure detection

• Admin features

• Region server (slaves)

ZOOKEEPER

• Highly available coordination system

• Used for locking, distributed conﬁguration, leader election,
cluster management...

• Curator makes it easy for common algorithms

PUTTING IT ALL TOGETHER
ZK
MR pipeline

HDFS Id tables
Solr
Topology

ZK

Feeds Spouts Bolts processor Bolt Indexer Slaves

MIXED ARCHITECTURE

• Ifthe number of segments in the index gets too big is has an
impact in search performance

• Building
indexes in batch allows to keep small number of
segments

• Gives near real time updates and it’s tolerant to human error

From legacy, to batch, to near real-time

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From legacy, to batch, to near real-time

Similar to From legacy, to batch, to near real-time (20)

From legacy, to batch, to near real-time

Editor's Notes