SlideShare uma empresa Scribd logo
1 de 26
Real Time Analytics for Big Data
A Twitter Inspired Case Study



                                   @natishalom
Big Data Predictions




         ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
                                                                2
The Two Vs of Big Data

         Velocity                                                   Volume




3            ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
We’re Living in a Real Time World…
        Social                           User Tracking &                 Homeland Security
                                          Engagement




      eCommerce                       Financial Services                 Real Time Search




4                 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
The Flavors of Big Data Analytics




       Counting                                Correlating               Research




5                 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Counting

     How many signups,
      tweets, retweets for a
      topic?
     What’s the average
      latency?
     Demographics
          Countries and cities
          Gender
          Age groups
          Device types
          …



6                     ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Correlating

     What devices fail at the
      same time?
     What features get user
      hooked?
     What places on the
      globe are “happening”?




7                 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Research

     Sentiment analysis
        “Obama is popular”
     Trends
        “People like to tweet
         after watching
         American Idol”
     Spam patterns
        How can you tell
         when a user spams?




8                   ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
It’s All about Timing




       “Real time”                      Reasonably Quick                     Batch
     (< few Seconds)                   (seconds - minutes)                (hours/days)




9                  ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
It’s All about Timing
               • Event driven / stream processing
               • High resolution – every tweet gets counted



               • Ad-hoc querying          This is what
               • Medium resolution (aggregations)
                                          we’re here              we’re here
                                                                  to discuss 
               • Long running batch jobs (ETL, map/reduce)
               • Low resolution (trends & patterns)

10         ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Challenge – Word Count
        Tweets




11
                                  ?
          ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
                                                                       Count




                                                                 • URL mentions
                                                                 • etc.
                                                                                Word:Count




                                                                 • Hottest topics
URL Mentions – Here’s One Use Case




12        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



     It takes a week for users to
     send   1 billion Tweets.
                                                      Source: http://blog.twitter.com/2011/03/numbers.html

13          ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



                On average,
        140 million
     tweets get sent every day.
                                                    Source: http://blog.twitter.com/2011/03/numbers.html

14        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



         The highest
     throughput to date is
6,939 tweets/sec.
                                                    Source: http://blog.twitter.com/2011/03/numbers.html

15        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



      460,000 new
       accounts
        are created daily.
                                                    Source: http://blog.twitter.com/2011/03/numbers.html

16        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers




     5% of the users generate
      75% of the content.
                                                            Source: http://www.sysomos.com/insidetwitter/

17        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Analyze the Problem

  (Tens of) thousands of tweets per second to
   process
      Assumption: Need to process in near real time
  Aggregate counters for each word
      A few 10s of thousands of words (or hundreds of
       thousands if we include URLs)
  System needs to linearly scale
  System needs to be fault tolerant


18            ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Key Elements in
                        Real Time Big Data Analytics




19   ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Sharding (Partitioning)

           Tokenizer                 Counter
                       Filterer 1
              1                     Updater 1

           Tokenizer                 Counter
                       Filterer 2   Updater 2
              2
           Tokenizer                 Counter
                       Filterer 3
              3                     Updater 3




           Tokenizer                 Counter
                       Filterer n
               n                    Updater n
Keep Things In Memory

   Facebook keeps 80% of its
   data in Memory
   (Stanford research)

   RAM is 100-1000x faster
   than Disk (Random seek)
   • Disk: 5 -10ms
   • RAM: ~0.001msec
Use EDA (Event Driven Architecture)




22        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Putting it all together




23         ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Know Your Toolset




24        ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
References
  Writing your own twitter analytics:
   http://ht.ly/d8j4I
  Detailed blog post
     http://bit.ly/gs-bigdata-analytics
  Twitter in numbers:
     http://blog.twitter.com/2011/03/numbers.html
  Twitter Storm:
     http://bit.ly/twitter-storm
  Apache S4
     http://incubator.apache.org/s4/


25               ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
26   ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Mais conteúdo relacionado

Mais de Nati Shalom

Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Nati Shalom
 
Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013
Nati Shalom
 
Disaster recovery on demand on the cloud
Disaster recovery on demand on the cloudDisaster recovery on demand on the cloud
Disaster recovery on demand on the cloud
Nati Shalom
 
Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)
Nati Shalom
 

Mais de Nati Shalom (20)

What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like What A No Compromises Hybrid Cloud Looks Like
What A No Compromises Hybrid Cloud Looks Like
 
Running OpenStack in Production
Running OpenStack in Production Running OpenStack in Production
Running OpenStack in Production
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
 
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStackReal World Example of Orchestrating Docker, Node JS, NFV on OpenStack
Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
 
OpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the SummitOpenStack Juno The Complete Lowdown and Tales from the Summit
OpenStack Juno The Complete Lowdown and Tales from the Summit
 
Application and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & ToscaApplication and Network Orchestration using Heat & Tosca
Application and Network Orchestration using Heat & Tosca
 
Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users Introduction to Cloudify for OpenStack users
Introduction to Cloudify for OpenStack users
 
Software Defined Operator
Software Defined OperatorSoftware Defined Operator
Software Defined Operator
 
Complex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real TimeComplex Analytics with NoSQL Data Store in Real Time
Complex Analytics with NoSQL Data Store in Real Time
 
Is Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOpsIs Orchestration the Next Big Thing in DevOps
Is Orchestration the Next Big Thing in DevOps
 
When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)When networks meets apps (open stack atlanta)
When networks meets apps (open stack atlanta)
 
Application Centric Approach to Devops
Application Centric Approach to DevopsApplication Centric Approach to Devops
Application Centric Approach to Devops
 
Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013Case Studies for moving apps to the cloud - DLD 2013
Case Studies for moving apps to the cloud - DLD 2013
 
Application Centric DevOps
Application Centric DevOpsApplication Centric DevOps
Application Centric DevOps
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Disaster Recovery on Demand on the Cloud
Disaster Recovery on Demand on the CloudDisaster Recovery on Demand on the Cloud
Disaster Recovery on Demand on the Cloud
 
Disaster recovery on demand on the cloud
Disaster recovery on demand on the cloudDisaster recovery on demand on the cloud
Disaster recovery on demand on the cloud
 
Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)Giga spaces cloudify road map-3 (citi)
Giga spaces cloudify road map-3 (citi)
 
Big Data on OpenStack
Big Data on OpenStackBig Data on OpenStack
Big Data on OpenStack
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Real Time Analytics for Big Data a Twiiter Case Study

  • 1. Real Time Analytics for Big Data A Twitter Inspired Case Study @natishalom
  • 2. Big Data Predictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved 2
  • 3. The Two Vs of Big Data Velocity Volume 3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 4. We’re Living in a Real Time World… Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search 4 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 5. The Flavors of Big Data Analytics Counting Correlating Research 5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 6. Analytics @ Twitter – Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … 6 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 7. Analytics @ Twitter – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? 7 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 8. Analytics @ Twitter – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? 8 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 9. It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days) 9 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) we’re here we’re here to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) 10 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 11. Challenge – Word Count Tweets 11 ? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count • URL mentions • etc. Word:Count • Hottest topics
  • 12. URL Mentions – Here’s One Use Case 12 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 13. Twitter in Numbers (March 2011) It takes a week for users to send 1 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html 13 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 14. Twitter in Numbers (March 2011) On average, 140 million tweets get sent every day. Source: http://blog.twitter.com/2011/03/numbers.html 14 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 15. Twitter in Numbers (March 2011) The highest throughput to date is 6,939 tweets/sec. Source: http://blog.twitter.com/2011/03/numbers.html 15 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 16. Twitter in Numbers (March 2011) 460,000 new accounts are created daily. Source: http://blog.twitter.com/2011/03/numbers.html 16 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 17. Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/ 17 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 18. Analyze the Problem  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant 18 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 19. Key Elements in Real Time Big Data Analytics 19 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 20. Sharding (Partitioning) Tokenizer Counter Filterer 1 1 Updater 1 Tokenizer Counter Filterer 2 Updater 2 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n
  • 21. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 22. Use EDA (Event Driven Architecture) 22 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 23. Putting it all together 23 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 24. Know Your Toolset 24 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 25. References  Writing your own twitter analytics: http://ht.ly/d8j4I  Detailed blog post http://bit.ly/gs-bigdata-analytics  Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html  Twitter Storm: http://bit.ly/twitter-storm  Apache S4 http://incubator.apache.org/s4/ 25 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 26. 26 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Notas do Editor

  1. ActiveInsight