SlideShare a Scribd company logo
1 of 30
FROM LEGACY, TO BATCH,
  TO NEAR REAL-TIME
      Marc Sturlese, Dani Solà
WHO ARE WE?

•   Marc Sturlese - @sturlese

    • Backend   engineer, focused on R&D

    • Interests: search, scalability

•   Dani Solà - @dani_sola

    • Backend   engineer

    • Interests: distributed   systems, data mining, search,...
TROVIT
Search engine for classifieds: 6 verticals, 38 countries & growing
FROM LEGACY TO BATCH

• Old   architecture

• Why    & when we changed

• Current     architecture

• Hive, Pig   & custom tools

• Migration    process
OLD ARCHITECTURE

• Based    on MySQL and PHP scripts

• Indexes     created with DataImportHandler


      Incoming data                  DataImportHandler




                                                         Lucene Indexes
                             MySQL

          PHP Scripts
WHEN & WHY WE MOVED

• Sharded   strategies are hard to maintain

• We    had 10M rows in a single table

• Many   processes working on MySQL databases

• We    wanted a more maintainable codebase

• The   solution was pretty obvious...
CURRENT ARCHITECTURE


• Based   on Hadoop

• Batch   process that reprocess all the ads...

• But   needs to be aware of the previous execution!

• Hive   & custom tools to know what happens
CURRENT ARCHITECTURE

Incoming data          External Data                                Lucene Indexes
                                                                           Deployment




Ad Processor    Diff     Matching      Expiration   Deduplication     Indexing

                           t-1                           Hadoop Cluster

                              Hive Stats
AD PROCESSOR

Incoming data     • Converts    text files to Thrift objects

                  • Checks    that the ads are complete

                  • Searches   for poisonwords
Ad Processor
                  • Checks    the value ranges

 Thrift           • Parses   text (dates, currencies, etc)
Objects
DIFF PHASE

ads t           ads t-1



                          • Performs   the diff between executions

         Diff
                          • Merges   the ads of both executions


        ads t
MATCHING PHASE

ads              External Data
                                 • Extracts   semantic information:

                                   • Geographical    information

                                   • Cars   makes and models
      Matching
                                   • Companies

  enriched                         • ...
    ads
EXPIRATION PHASE

   ads
               • Works   as a filter

               • Deletes:

  Expiration
                 • Expired   ads

                 • Incorrect   ads
ads to be
 indexed
DEDUPLICATION PHASE

                   • Duplicates   are a big issue for us
     ads
                   • Youcannot compare N ads against
                    each other

                   • Solution:
   Deduplication
                     • Use heuristics to create “possible
                      duplicates” groups
deduplicated
    ads              • Compare     all the ads of each group
INDEXING PHASE

   ads           • Is   actually done with two phases

                 • First   we create micro indexes

                   • We     use Embedded Solr Server
  Expiration
                 • Then    we merge them

                   • Plain   Lucene

Lucene Indexes
HIVE, PIG & CUSTOM TOOLS

            • Critical:

              • To   know that is going on (control info)

              • To   debug

              • To   prototype new processes

              • To   understand your data
grep, cat
              • To   create reports
MIGRATION PROCESS


• Used Amazon    EC2 to test different cluster configurations

• Maintained   both systems running during one month

• Switched   to the new system gradually, one country at a time

• Then   we moved the cluster to our own servers
FROM BATCH
              TO NEAR REAL-TIME
• Batch   is not enough

• Storm     for real time data processing

• HBase     for data storage

• Zookeeper      for systems coordination

• Putting   it all together

• Batch   and NRT. Mixed architecture
BATCH IS NOT ENOUGH


• Dataprocessing with map reduce scales well but takes time
 and has latency

• Crunch   documents in batch means wait until all is processed

• We   want to show the user fresher results!
BATCH IS NOT ENOUGH
                                                           ZK
            MR pipeline




              HDFS                        Id tables
• Storm     + HBase + Zookeeper looks like a good
                                               Solr
 feed !!!
          Topology



                                                      ZK

Feeds         Spouts      Bolts   Bolts                    Slaves
STORM - PROPERTIES

• Distributed   real time computation system

• Fault   tolerance

• Horizontal    scalability

• Low     latency

• Reliability
STORM - COMPONENTS

• Tuple

• Stream

• Spout

• Bolt

• Topology
STORM IN ACTION
         Spouts      Bolts     Bolts




                     Streams
                        of
                      tuples




Queue             Topology             DataStore
STORM - DAEMONS


• Nimbus

• Supervisors

• Workers
HBASE - PROPERTIES

• Distributed, sorted    map datastore

• Automatic   failover

• Rows   are sorted

• Many   columns per row

• Good   Hadoop integration
HBASE - COMPONENTS


• Master

 • Slave   coordination and failure detection

 • Admin    features

• Region   server (slaves)
ZOOKEEPER


• Highly   available coordination system

• Used for locking, distributed configuration, leader election,
 cluster management...

• Curator   makes it easy for common algorithms
PUTTING IT ALL TOGETHER
                                                                   ZK
        MR pipeline




          HDFS                                    Id tables
                                                                   Solr
        Topology



                                                              ZK

Feeds     Spouts      Bolts processor   Bolt Indexer               Slaves
MIXED ARCHITECTURE


• Ifthe number of segments in the index gets too big is has an
  impact in search performance

• Building
         indexes in batch allows to keep small number of
  segments

• Gives   near real time updates and it’s tolerant to human error
THANK YOU!
  QUESTIONS?

More Related Content

What's hot

Scylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and FutureScylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and FutureScyllaDB
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka AggregationKafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregationconfluent
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...ScyllaDB
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
Building a distributed Key-Value store with Cassandra
Building a distributed Key-Value store with CassandraBuilding a distributed Key-Value store with Cassandra
Building a distributed Key-Value store with Cassandraaaronmorton
 
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with JepsenTesting Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsenjkni
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talkKrishna Gade
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...HBaseCon
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
Stream Processing in Uber
Stream Processing in UberStream Processing in Uber
Stream Processing in UberC4Media
 

What's hot (20)

Scylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and FutureScylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and Future
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka AggregationKafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Kafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backboneKafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backbone
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Building a distributed Key-Value store with Cassandra
Building a distributed Key-Value store with CassandraBuilding a distributed Key-Value store with Cassandra
Building a distributed Key-Value store with Cassandra
 
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with JepsenTesting Cassandra Guarantees under Diverse Failure Modes with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Stream Processing in Uber
Stream Processing in UberStream Processing in Uber
Stream Processing in Uber
 

Similar to From legacy, to batch, to near real-time

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Ricard Clau
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012Eonblast
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Big Data Spain
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
 

Similar to From legacy, to batch, to near real-time (20)

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
 

From legacy, to batch, to near real-time

  • 1. FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà
  • 2. WHO ARE WE? • Marc Sturlese - @sturlese • Backend engineer, focused on R&D • Interests: search, scalability • Dani Solà - @dani_sola • Backend engineer • Interests: distributed systems, data mining, search,...
  • 3. TROVIT Search engine for classifieds: 6 verticals, 38 countries & growing
  • 4.
  • 5. FROM LEGACY TO BATCH • Old architecture • Why & when we changed • Current architecture • Hive, Pig & custom tools • Migration process
  • 6. OLD ARCHITECTURE • Based on MySQL and PHP scripts • Indexes created with DataImportHandler Incoming data DataImportHandler Lucene Indexes MySQL PHP Scripts
  • 7. WHEN & WHY WE MOVED • Sharded strategies are hard to maintain • We had 10M rows in a single table • Many processes working on MySQL databases • We wanted a more maintainable codebase • The solution was pretty obvious...
  • 8. CURRENT ARCHITECTURE • Based on Hadoop • Batch process that reprocess all the ads... • But needs to be aware of the previous execution! • Hive & custom tools to know what happens
  • 9. CURRENT ARCHITECTURE Incoming data External Data Lucene Indexes Deployment Ad Processor Diff Matching Expiration Deduplication Indexing t-1 Hadoop Cluster Hive Stats
  • 10. AD PROCESSOR Incoming data • Converts text files to Thrift objects • Checks that the ads are complete • Searches for poisonwords Ad Processor • Checks the value ranges Thrift • Parses text (dates, currencies, etc) Objects
  • 11. DIFF PHASE ads t ads t-1 • Performs the diff between executions Diff • Merges the ads of both executions ads t
  • 12. MATCHING PHASE ads External Data • Extracts semantic information: • Geographical information • Cars makes and models Matching • Companies enriched • ... ads
  • 13. EXPIRATION PHASE ads • Works as a filter • Deletes: Expiration • Expired ads • Incorrect ads ads to be indexed
  • 14. DEDUPLICATION PHASE • Duplicates are a big issue for us ads • Youcannot compare N ads against each other • Solution: Deduplication • Use heuristics to create “possible duplicates” groups deduplicated ads • Compare all the ads of each group
  • 15. INDEXING PHASE ads • Is actually done with two phases • First we create micro indexes • We use Embedded Solr Server Expiration • Then we merge them • Plain Lucene Lucene Indexes
  • 16. HIVE, PIG & CUSTOM TOOLS • Critical: • To know that is going on (control info) • To debug • To prototype new processes • To understand your data grep, cat • To create reports
  • 17. MIGRATION PROCESS • Used Amazon EC2 to test different cluster configurations • Maintained both systems running during one month • Switched to the new system gradually, one country at a time • Then we moved the cluster to our own servers
  • 18. FROM BATCH TO NEAR REAL-TIME • Batch is not enough • Storm for real time data processing • HBase for data storage • Zookeeper for systems coordination • Putting it all together • Batch and NRT. Mixed architecture
  • 19. BATCH IS NOT ENOUGH • Dataprocessing with map reduce scales well but takes time and has latency • Crunch documents in batch means wait until all is processed • We want to show the user fresher results!
  • 20. BATCH IS NOT ENOUGH ZK MR pipeline HDFS Id tables • Storm + HBase + Zookeeper looks like a good Solr feed !!! Topology ZK Feeds Spouts Bolts Bolts Slaves
  • 21. STORM - PROPERTIES • Distributed real time computation system • Fault tolerance • Horizontal scalability • Low latency • Reliability
  • 22. STORM - COMPONENTS • Tuple • Stream • Spout • Bolt • Topology
  • 23. STORM IN ACTION Spouts Bolts Bolts Streams of tuples Queue Topology DataStore
  • 24. STORM - DAEMONS • Nimbus • Supervisors • Workers
  • 25. HBASE - PROPERTIES • Distributed, sorted map datastore • Automatic failover • Rows are sorted • Many columns per row • Good Hadoop integration
  • 26. HBASE - COMPONENTS • Master • Slave coordination and failure detection • Admin features • Region server (slaves)
  • 27. ZOOKEEPER • Highly available coordination system • Used for locking, distributed configuration, leader election, cluster management... • Curator makes it easy for common algorithms
  • 28. PUTTING IT ALL TOGETHER ZK MR pipeline HDFS Id tables Solr Topology ZK Feeds Spouts Bolts processor Bolt Indexer Slaves
  • 29. MIXED ARCHITECTURE • Ifthe number of segments in the index gets too big is has an impact in search performance • Building indexes in batch allows to keep small number of segments • Gives near real time updates and it’s tolerant to human error
  • 30. THANK YOU! QUESTIONS?

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n