Big data and noSQL in real time

•Transferir como PPTX, PDF•

1 gostou•1,017 visualizações

Explain the challenge of having real time analytics in big data and nosql applications. Showing Facebook and Twitter examples.

Tecnologia

Big Data and NoSQL in REAL TIME
Facebook and Twitter Examples
Ron Zavner

Agenda
 Our real time world…
 Flavors of Big Data
 Facebook messaging and real time analytics system
 Twitter analytics system
 Winning architecture?
2
® Copyright 2011 Gigaspaces Ltd. All Rights

What is Real Time?
3
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
5
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
The Two Vs of Big Data
Velocity Volume

The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

Analytics – Counting
 How many
signups, tweets, retweet
s for a topic?
 What’s the average
latency?
 Demographics
 Countries and cities
 Gender
 Age groups
 Device types
 …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

Analytics – Correlating
 What devices fail at the
same time?
 What features get user
hooked?
 What places on the
globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

Analytics – Research
 Sentiment analysis
 “Obama is popular”
 Trends
 “People like to tweet
after watching
American Idol”
 Spam patterns
 How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
This is what
we’re here to
discuss 

Store 135+ Billion Messages A Month
13
® Copyright 2011 Gigaspaces Ltd. All Rights

The actual analytics..
 Like button analytics
 Comments box analytics
14
® Copyright 2011 Gigaspaces Ltd. All Rights

Goals
 Show why plugins are valuable
 Make the data more actionable
 Make the data more timely
 Remove point of failures
 Handle massive load - 200K events per second
15
® Copyright 2011 Gigaspaces Ltd. All Rights

Technology Evaluation
 MySQL DB Counters
 In-Memory Counters
 MapReduce
 Cassandra
 HBase
16
® Copyright 2011 Gigaspaces Ltd. All Rights

PTail
Scribe
Puma
Hbase
FACEBOOK
Log
FACEBOOK
Log
FACEBOOK
Log
HDFS
Real Time Long Term
Batch
1.5 Sec
The solution..
10,000
write/sec
per server

Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved20
Twitter Reach – Here’s One Use Case

Let’s start with some
statistics ….
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved21
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

It takes a week for users to
send 1 billion Tweets.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

On average,
140 million
tweets get sent every day.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

The highest
throughput to date is
6,939 tweets/sec.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

460,000 new
accounts
are created daily.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html

5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/

Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved27
• Hottest topics
• URL mentions
• etc.

 (Tens of) thousands of tweets per second to
process
 Assumption: Need to process in near real time
 Aggregate counters for each word
 A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
 System needs to linearly scale
 System needs to be fault tolerant
Word Count - Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved29
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized CounterFiltered

Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer
3
Filterer 3
Tokenizer
n
Filterer n
Counter
Updater 1
Counter
Updater 2
Counter
Updater 3
Counter
Updater n

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved31
Computing Reach with Event Streams

Twitter Storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved32

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved33
Twitter Storm

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved34
Storm Overview

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved35
Storm Cluster

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved36
Streaming word count with Storm

 Storage
 Data Persistency
 Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved37
Storm Limitation
Spouts
Bolt
Topologies

 Event driven / flow
 Reliable
 Storage
 Data Persistency
 Querying
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved38
Winner is… storm & in memory data grids

 Facebook messages
 http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-
messaging-system-hbase-to-store-135.html
 Facebook Real time analytics
 http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-
analytics-system-hbase-to-process-20.html
 Learn and fork the code on github:
https://github.com/Gigaspaces/rt-analytics
 Detailed blog post
http://bit.ly/gs-bigdata-analytics
 Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html
 Twitter Storm:
http://bit.ly/twitter-storm
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39
References

RonZ@gigaspaces.com
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved40
Q&A

Mais conteúdo relacionado

Último

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Exploring Multimodal Embeddings with Milvus

Zilliz

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Keynote 2: APIs in 2030: The Risk of Technological Sleepwalk Paolo Malinverno, Growth Advisor - The Business of Technology Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

apidays

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

DBX First Quarter 2024 Investor Presentation

Dropbox

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

The action of the next cyber saga takes place in the mystical lands of the Asia-Pacific region, where the main characters began their digital activities in the middle of 2021 and qualitatively strengthened it in 2022. Corporate espionage, document theft, audio recordings, and data leaks from messaging platforms were all a matter of one day for Dark Pink. Their geographical focus may have started in the Asia-Pacific region, but their ambitions knew no bounds, targeting a European government ministry in a bold move to expand their portfolio. Their victim profile was as diverse as a UN meeting, targeting military organizations, government agencies, and even a religious organization. Because discrimination is not a fashionable agenda. In the world of cybercrime, they serve as a reminder that sometimes the most serious threats come in the most unassuming packages with a pink bow.

Cyberprint. Dark Pink Apt Group [EN].pdf

Overkill Security

Destaque

Product Design Trends in 2024 | Teenage Engineerings

Pixeldarts

Mental health has been in the news quite a bit lately. Dozens of U.S. states are currently suing Meta for contributing to the youth mental health crisis by inserting addictive features into their products, while the U.S. Surgeon General is touring the nation to bring awareness to the growing epidemic of loneliness and isolation. The country has endured periods of low national morale, such as in the 1970s when high inflation and the energy crisis worsened public sentiment following the Vietnam War. The current mood, however, feels different. Gallup recently reported that national mental health is at an all-time low, with few bright spots to lift spirits. To better understand how Americans are feeling and their attitudes towards mental health in general, ThinkNow conducted a nationally representative quantitative survey of 1,500 respondents and found some interesting differences among ethnic, age and gender groups. Technology For example, 52% agree that technology and social media have a negative impact on mental health, but when broken out by race, 61% of Whites felt technology had a negative effect, and only 48% of Hispanics thought it did. While technology has helped us keep in touch with friends and family in faraway places, it appears to have degraded our ability to connect in person. Staying connected online is a double-edged sword since the same news feed that brings us pictures of the grandkids and fluffy kittens also feeds us news about the wars in Israel and Ukraine, the dysfunction in Washington, the latest mass shooting and the climate crisis. Hispanics may have a built-in defense against the isolation technology breeds, owing to their large, multigenerational households, strong social support systems, and tendency to use social media to stay connected with relatives abroad. Age and Gender When asked how individuals rate their mental health, men rate it higher than women by 11 percentage points, and Baby Boomers rank it highest at 83%, saying it’s good or excellent vs. 57% of Gen Z saying the same. Gen Z spends the most amount of time on social media, so the notion that social media negatively affects mental health appears to be correlated. Unfortunately, Gen Z is also the generation that’s least comfortable discussing mental health concerns with healthcare professionals. Only 40% of them state they’re comfortable discussing their issues with a professional compared to 60% of Millennials and 65% of Boomers. Race Affects Attitudes As seen in previous research conducted by ThinkNow, Asian Americans lag other groups when it comes to awareness of mental health issues. Twenty-four percent of Asian Americans believe that having a mental health issue is a sign of weakness compared to the 16% average for all groups. Asians are also considerably less likely to be aware of mental health services in their communities (42% vs. 55%) and most likely to seek out information on social media (51% vs. 35%).

How Race, Age and Gender Shape Attitudes Towards Mental Health

ThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

marketingartwork

Skeleton Culture Code

Skeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024

Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)

contently

How to Prepare For a Successful Job Search for 2024

Albert Qian

A report by thenetworkone and Kurio. The contributing experts and agencies are (in an alphabetical order): Sylwia Rytel, Social Media Supervisor, 180heartbeats + JUNG v MATT (PL), Sharlene Jenner, Vice President - Director of Engagement Strategy, Abelson Taylor (USA), Alex Casanovas, Digital Director, Atrevia (ES), Dora Beilin, Senior Social Strategist, Barrett Hoffher (USA), Min Seo, Campaign Director, Brand New Agency (KR), Deshé M. Gully, Associate Strategist, Day One Agency (USA), Francesca Trevisan, Strategist, Different (IT), Trevor Crossman, CX and Digital Transformation Director; Olivia Hussey, Strategic Planner; Simi Srinarula, Social Media Manager, The Hallway (AUS), James Hebbert, Managing Director, Hylink (CN / UK), Mundy Álvarez, Planning Director; Pedro Rojas, Social Media Manager; Pancho González, CCO, Inbrax (CH), Oana Oprea, Head of Digital Planning, Jam Session Agency (RO), Amy Bottrill, Social Account Director, Launch (UK), Gaby Arriaga, Founder, Leonardo1452 (MX), Shantesh S Row, Creative Director, Liwa (UAE), Rajesh Mehta, Chief Strategy Officer; Dhruv Gaur, Digital Planning Lead; Leonie Mergulhao, Account Supervisor - Social Media & PR, Medulla (IN), Aurelija Plioplytė, Head of Digital & Social, Not Perfect (LI), Daiana Khaidargaliyeva, Account Manager, Osaka Labs (UK / USA), Stefanie Söhnchen, Vice President Digital, PIABO Communications (DE), Elisabeth Winiartati, Managing Consultant, Head of Global Integrated Communications; Lydia Aprina, Account Manager, Integrated Marketing and Communications; Nita Prabowo, Account Manager, Integrated Marketing and Communications; Okhi, Web Developer, PNTR Group (ID), Kei Obusan, Insights Director; Daffi Ranandi, Insights Manager, Radarr (SG), Gautam Reghunath, Co-founder & CEO, Talented (IN), Donagh Humphreys, Head of Social and Digital Innovation, THINKHOUSE (IRE), Sarah Yim, Strategy Director, Zulu Alpha Kilo (CA).

Social Media Marketing Trends 2024 // The Global Indie Insights

Kurio // The Social Media Age(ncy)

The search marketing landscape is evolving rapidly with new technologies, and professionals, like you, rely on innovative paid search strategies to meet changing demands. It’s important that you’re ready to implement new strategies in 2024. Check this out and learn the top trends in paid search advertising that are expected to gain traction, so you can drive higher ROI more efficiently in 2024. You’ll learn: - The latest trends in AI and automation, and what this means for an evolving paid search ecosystem. - New developments in privacy and data regulation. - Emerging ad formats that are expected to make an impact next year. Watch Sreekant Lanka from iQuanti and Irina Klein from OneMain Financial as they dive into the future of paid search and explore the trends, strategies, and technologies that will shape the search marketing landscape. If you’re looking to assess your paid search strategy and design an industry-aligned plan for 2024, then this webinar is for you.

Trends In Paid Search: Navigating The Digital Landscape In 2024

Search Engine Journal

From their humble beginnings in 1984, TED has grown into the world’s most powerful amplifier for speakers and thought-leaders to share their ideas. They have over 2,400 filmed talks (not including the 30,000+ TEDx videos) freely available online, and have hosted over 17,500 events around the world. With over one billion views in a year, it’s no wonder that so many speakers are looking to TED for ideas on how to share their message more effectively. The article “5 Public-Speaking Tips TED Gives Its Speakers”, by Carmine Gallo for Forbes, gives speakers five practical ways to connect with their audience, and effectively share their ideas on stage. Whether you are gearing up to get on a TED stage yourself, or just want to master the skills that so many of their speakers possess, these tips and quotes from Chris Anderson, the TED Talks Curator, will encourage you to make the most impactful impression on your audience. See the full article and more summaries like this on SpeakerHub here: https://speakerhub.com/blog/5-presentation-tips-ted-gives-its-speakers See the original article on Forbes here: http://www.forbes.com/forbes/welcome/?toURL=http://www.forbes.com/sites/carminegallo/2016/05/06/5-public-speaking-tips-ted-gives-its-speakers/&refURL=&referrer=#5c07a8221d9b

5 Public speaking tips from TED - Visualized summary

SpeakerHub

Everyone is in agreement that ChatGPT (and other generative AI tools) will shape the future of work. Yet there is little consensus on exactly how, when, and to what extent this technology will change our world. Businesses that extract maximum value from ChatGPT will use it as a collaborative tool for everything from brainstorming to technical maintenance. For individuals, now is the time to pinpoint the skills the future professional will need to thrive in the AI age. Check out this presentation to understand what ChatGPT is, how it will shape the future of work, and how you can prepare to take advantage.

ChatGPT and the Future of Work - Clark Boyd

Clark Boyd

Getting into the tech field. what next

Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Lily Ray

How to have difficult conversations

Rajiv Jayarajah, MAppComm, ACC

Introduction to Data Science

Christy Abraham Joy

Time Management & Productivity - Best Practices

Vit Horky

The six step guide to practical project management If you think managing projects is too difficult, think again. We’ve stripped back project management processes to the basics – to make it quicker and easier, without sacrificing the vital ingredients for success. “If you’re looking for some real-world guidance, then The Six Step Guide to Practical Project Management will help.” Dr Andrew Makar, Tactical Project Management

The six step guide to practical project management

MindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

RachelPearson36

During this webinar, Anand Bagmar demonstrates how AI tools such as ChatGPT can be applied to various stages of the software development life cycle (SDLC) using an eCommerce application case study. Find the on-demand recording and more info at https://applitools.info/b59 Key takeaways: • Learn how to use ChatGPT to add AI power to your testing and test automation • Understand the limitations of the technology and where human expertise is crucial • Gain insight into different AI-based tools • Adopt AI-based tools to stay relevant and optimize work for developers and testers * ChatGPT and OpenAI belong to OpenAI, L.L.C.

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Applitools

12 Ways to Increase Your Influence at Work

GetSmarter

Destaque (20)

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

Big data and noSQL in real time

1. Big Data and NoSQL in REAL TIME Facebook and Twitter Examples Ron Zavner

2. Agenda  Our real time world…  Flavors of Big Data  Facebook messaging and real time analytics system  Twitter analytics system  Winning architecture? 2 ® Copyright 2011 Gigaspaces Ltd. All Rights

5. Big Data Predictions “Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY 5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

8. Analytics – Counting  How many signups, tweets, retweet s for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

10. Analytics – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

11. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying • Medium resolution • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11 This is what we’re here to discuss 

12. FACEBOOK REAL-TIME ANALYTICS SYSTEM 12

15. Goals  Show why plugins are valuable  Make the data more actionable  Make the data more timely  Remove point of failures  Handle massive load - 200K events per second 15 ® Copyright 2011 Gigaspaces Ltd. All Rights

17. PTail Scribe Puma Hbase FACEBOOK Log FACEBOOK Log FACEBOOK Log HDFS Real Time Long Term Batch 1.5 Sec The solution.. 10,000 write/sec per server

18. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec

19. TWITTER REAL-TIME ANALYTICS SYSTEM 19

28.  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant Word Count - Analyze the Problem ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved28

30. Sharding (Partitioning) Tokenizer1 Filterer 1 Tokenizer2 Filterer 2 Tokenizer 3 Filterer 3 Tokenizer n Filterer n Counter Updater 1 Counter Updater 2 Counter Updater 3 Counter Updater n

39.  Facebook messages  http://highscalability.com/blog/2010/11/16/facebooks-new-real-time- messaging-system-hbase-to-store-135.html  Facebook Real time analytics  http://highscalability.com/blog/2011/3/22/facebooks-new-realtime- analytics-system-hbase-to-process-20.html  Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics  Detailed blog post http://bit.ly/gs-bigdata-analytics  Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html  Twitter Storm: http://bit.ly/twitter-storm ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved39 References

Notas do Editor

Real time is ideally less than a second, not 30 seconds, not 5 seconds
We live almost every aspect of our lives in a real-time world. Think about our social communications; we update our friends online via social networks and micro-blogging, we text from our mobiles, or message from our laptops. But it's not just our social lives; we shop online whenever we want, we search the web for immediate answers to our questions, we trade stocks online, we pay our bills, and do our banking. All online and all in real time.Real time doesn't just affect our personal lives. Enterprises and government agencies need real-time insights to be successful, whether they are investment firms that need fast access to market views and risk analysis, or retailers that need to adjust their online campaigns and recommendations. Even homeland security has come to increasingly rely on real-time monitoring.The amount of data that flows in these systems is huge.Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
Big data is definitely expected to grow and expand. Amount of data is growing and the demand grows as well. The requirements for analytics in real time is a must.
The Two Vs of Big Data are velocity and volume. As said before, the volume of data we need to handle is huge and at the same time we need to do it fast. We are required to make very complex calculations in read time and we need to perform those for a very large amount of data. The data is usually spread among many servers, distributed and each server would perform it’s calculation and then results would be aggregated – map reduce. This is a very common pattern to perform real time analytics. Having said that, we can see that sometimes the latency requirement is more challenging and we need to improve the time it takes to make these calculations. You can’t go straight to relational DB – not designed to handle the speed and volumes we’re talking about, that’s why we can look at NoSQL or cache.NoSQL can go further // I don’t have contraints of a relational db and I can store the data as it is (in JSON – the format used by Twitter) – but processing the sheer amount of data in the timeframes we need is incredibly challenging.
I think analytics – when we’re talking about Big Data and something like Twitter – can be split into three categories, or buckets.The first bucket is “Counting” How many signups, tweets or retweets are there for a topic?I might also be interested in counting in relation to demographic information – for example, how many people are tweeting right now at this event and on what types of devices?The “Correlating” bucket might contain questions like how many twitter users are using desktop vs mobile - and what's the trend? Within the week, within in the month?Our 3rd bucket “Research” is similar to 2, but looking at more depth in the past – here we require a lot of processing of historic data
Counting calculations – we expect to see results in real time.The challenge is reliability > not that we lose money, but the accuracy of the system is going to be damaged, so the value of the report is going to be meaningless. Counting requires a very high high resolution - every tweet counts - we don't know which one will be important. If we lose something, the accuracy of the system will be damaged.
Correlating – we expect to see most results also in real time.These are the interactive queries where we expect a result that I can layout in my browser or a BI tool.
Research calcsare historical and Hadoop (for example) is a very popular framework for doing batch analytics. We don’t expect for real time response here but you never know what’s next 
It’s All about Timing.We expect to see real time results for lots of our calculations.We also need to make sure that our architecture allows us to be scalable.Today we might need to work with 100K TPS and it can easily grow to 200K TPS.We need to be highly available as well, we need to ensure zero downtime.For these we can use event driven and stream processing architectures.Correlation and research calculations are very interesting topics and we can expect longer response time, we however are going to examine the real time challenge.
We are going to talk about how facebook real time analytics system and also how they choose to store 135+ billion messages a month
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.htmlYou may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? One of the posts gave the surprise answer - HBase beat out MySQL, Cassandra, and a few others.Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure, but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase.HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data. Exactly what is needed for a Messaging system. HBase is also a column based key-value store built on the BigTable model. It's good at fetching rows by key or scanning ranges of rows and filtering. Also what is needed for a Messaging system. Complex queries are not supported however. Queries are generally given over to an analytics tool like Hive, which Facebook created to make sense of their multi-petabyte data warehouse, and Hive is based on Hadoop's file system, HDFS, which is also used by HBase.
Over the past year, social plugins have become an important and growing source of traffic for millions of websites. Today we're releasing a new version of Insights for Websites to give you better analytics on how people interact with your content and to help you optimize your website in real-time.Like button analyticsFor the first time, you can now access real-time analytics to optimize Like buttons across both your site and on Facebook. We use anonymized data to show you the number of times people saw Like buttons, clicked Like buttons, saw Like stories on Facebook, and clicked Like stories to visit your website.
Plugins are valueableSocial plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.Data actionableHelp users take action to make their content more valuable.How many people see a plugin, how many people take action on it, and how many are converted to traffic back on your site. Make the data more timelyWent from a 48-hour turn around to 30 seconds.Multiple points of failure were removed to make this goal.
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlMySQL DB CountersHave a row with a key and a counter.Results in lots of database activity.Stats are kept at a day bucket granularity. Every day at midnight the stats would roll over. When the roll over period is reached this resulted in a lot of writes to the database, which caused a lot of lock contention.Tried to spread the work by taking into account time zones. Tried to shard things differently.The high write rate led to lock contention, it was easy to overload the databases, had to constantly monitor the databases, and had to rethink their sharding strategy.Solution not well tailored to the problem.In-Memory CountersIf you are worried about bottlenecks in IO then throw it all in-memory.No scale issues. Counters are stored in memory so writes are fast and the counters are easy to shard.Felt in-memory counters, for reasons not explained, weren't as accurate as other approaches. Even a 1% failure rate would be unacceptable. Analytics drive money so the counters have to be highly accurate. They didn't implement this system. It was a thought experiment and the accuracy issue caused them to move on.MapReduceUsed Hadoop/Hive for previous solution. Flexible. Easy to get running. Can handle IO, both massive writes and reads. Don't have to know how they will query ahead of time. The data can be stored and then queried.Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals.CassandraHBase seemed a better solution based on availability and the write rate.Write rate was the huge bottleneck being solved.
http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.htmlThe Winner: HBase + Scribe + Ptail + PumaAt a high level:HBase stores data across distributed machines.Use a tailing architecture, new events are stored in log files, and the logs are tailed.A system rolls the events up and writes them into storage.A UI pulls the data out and displays it to users.Data FlowUser clicks Like on a web page.Fires AJAX request to Facebook.Request is written to a log file using Scribe. Scribe handles issues like file roll over.Scribe is built on the same HTFS file store Hadoop is built on.Write extremely lean log lines. The more compact the log lines the more can be stored in memory.PtailData is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out.Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters.Plugin impressionNews feed impressionsActions (plugin + news feed)PumaBatch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better.Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable.Wait for last flush to complete for starting new batch to avoid lock contention issues.UI Renders DataFrontends are all written in PHP.The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services.Caching solutions are used to make the web pages display more quickly.Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds. The more and longer data is cached the less realtime it is.Set different caching TTLs in memcache.MapReduceThe data is then sent to MapReduce servers so it can be queried via Hive.This also serves as a backup plan as the data can be recovered from Hive.Raw logs are removed after a period of time.HBase is a distribute column store. Database interface to Hadoop. Facebook has people working internally on HBase. Unlike a relational database you don't create mappings between tables.You don't create indexes. The only index you have a primary row key.From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime.Key feature to scalability and reliability is the WAL, write ahead log, which is a log of the operations that are supposed to occur. Based on the key, data is sharded to a region server. Written to WAL first.Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk.If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss.Use a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably. HBase handles failure detection and automatically routes across failures.Currently HBaseresharding is done manually.Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.Schema Store on a per URL basis a bunch of counters.A row key, which is the only lookup key, is the MD5 hash of the reverse domainSelecting the proper key structure helps with scanning and sharding.A problem they have is sharding data properly onto different machines. Using a MD5 hash makes it easier to say this range goes here and that range goes there. For URLs they do something similar, plus they add an ID on top of that. Every URL in Facebook is represented by a unique ID, which is used to help with sharding.A reverse domain, com.facebook/ for example, is used so that the data is clustered together. HBase is really good at scanning clustered data, so if they store the data so it's clustered together they can efficiently calculate stats across domains. Think of every row a URL and every cell as a counter, you are able to set different TTLs (time to live) for each cell. So if keeping an hourly count there's no reason to keep that around for every URL forever, so they set a TTL of two weeks. Typically set TTLs on a per column family basis. Per server they can handle 10,000 writes per second. Checkpointing is used to prevent data loss when reading data from log files. Tailers save log stream check points in HBase.Replayed on startup so won't lose data.Useful for detecting click fraud, but it doesn't have fraud detection built in.Tailer Hot SpotsIn a distributed system there's a chance one part of the system can be hotter than another.One example are region servers that can be hot because more keys are being directed that way.One tailer can be lag behind another too.If one tailer is an hour behind and the others are up to date, what numbers do you display in the UI?For example, impressions have a way higher volume than actions, so CTR rates were way higher in the last hour.Solution is to figure out the least up to date tailer and use that when querying metrics.
In Twitter, the primary relationship between entities is many-to-many. Every post is sent to numerous followers of the user who sent the post; at the same time, each user can follow many other users. This causes Twitter to behave like a living organism, growing unexpectedly in many different directions.Let me give you an example. One analytic where I need to process tweets is to determine Twitter Reach – Reach is how many unique Twitter accounts received tweets about my topic.So, how do I compute my reach?There are several stages in the processing1. First, I need to record every tweet2. Then I can count how many followers got that tweet3. Then I need to understand the distinct reach and I need to account for this > meaning for each follower I need to look at each of their followers and remove the duplicates.Try to image what it takes to produce that number. If my tweet is retweeted by 100 users, each of whom has 100 followers – well, it starts to take a fair bit of number crunching.
Read mostly – duplicate the data so you can optimize the read.
Let’s analyze the problems that a simple Twitter word count presentsThe challenge here seems straightforward:Tens of thousands of tweets need to be stored and parsed every secondWord counters need to be aggregated continuously. Even though tweets are limited to 140 characters, we are dealing with hundreds of thousands of words per second.This is big.
In many ways this is the bench mark for other systems because it does stretch the limits > There is a huge amount of activity to analyze – the scale is enormous> And we want to grab a lot of information out of it – and this is the challenge - how do we grab the stream in real time without effecting latency?> how do we deal w/ that stream in real-time?> how do we handle the write scalability in real-time?> how do we make the system bullet-proof and easily scalable?> how do we begin to do analytics on this?
Storm is a real time, open source data streaming framework that functions entirely in memory. Storm is designed to be run on several machines to provided parallelism.Real-time processing is becoming very popular, and Storm is a popular open source framework and runtime used by Twitter for processing real-time data streams. Storm addresses the complexity of running real time streams through a compute cluster by providing an elegant set of abstractions that make it easier to reason about your problem domain by letting you focus on data flows rather than on implementation details.
It constructs a processing graph that feeds data from an input source through processing nodes. The processing graph is called a "topology". The input data sources are called "spouts", and the processing nodes are called "bolts". The data model consists of tuples. Tuples flow from Spouts to the bolts, which execute user code. Besides simply being locations where data is transformed or accumulated, bolts can also join streams and branch streams. Storm topologies are deployed in a manner somewhat similar to a webapp; a jar file is presented to a deployer which distributes it around the cluster where it is loaded and executed. A topology runs until it is killed.
zookeeper - Storm uses Zookeeper to communicate between the "Nimbus"(master) and the 'Supervisors" (workers), as well as to store its current state. Zookeeper coodinates activity in the cluster, and provides operational state storage.storm-nimbus – The topology execution coordinator for the cluster. The Nimbus is a singleton in the cluster (i.e. not elastic). It is stateless however (due to storing state in Zookeeper) and there for can fail and be restarted without consequence even to running jobs.storm-supervisor – The supervisors actually run the topology code. There can/should be many of these (i.e. elastic). The parallelism attributes of a given topology are specified in the topology itself.
Data grids are more event driven based while strom is used for flow/streaming. Storm have more capabilites. Storm is very specifically directed at the streaming problem, and is optimized for that use case. In order to produce extremely high throughput, it pushes responsibility for reliability outside of its own framework. Also because of its streaming focus, it provides higher level abstractions that make reasoning about streaming easier than in XAP.Reliable - The architecture is oriented to making data in-memory nearly as reliable as that on disk. Thus, writing into XAP involves some level of serialization and perhaps a network hop as well. Storm doesn't aspire to this level of reliability, instead it provides the means for the suppliers and consumers of data to provide it instead. Storm is "optimistic" in roughly the same sense that an optimistic lock in a database is optimistic: it assumes success is far more likely than failure, and so is willing to big hits to performance when failures occur because they are so rare. XAP is more pessimistic in this sense. XAP is designed to be a source of truth for the data it holds, and goes to great lengths to achieve it.For reasons sited above, there is no way, even in principle, for XAP to have a comparable thoughput to Storm: at least when there is no persistence. This caveat is critical however, since real world systems almost always need persistence, and ultra-fast in-memory persistence is one of XAP's main strengths. I also mentioned that Storm has higher level abstractions for Streaming, which make programming it more straightforward for streaming applications. Whereas in XAP you could implement streaming as a series of event driven processing stages, there is no concept of a "stream" or any kind of "flow" at the API level.Storm with XAPBasically, Spouts provide the source of tuples for Storm processing. For spouts to be maximally performant and reliable, they need to provide tuples in batches, and be able to replay failed batches when necessary. Of course, in order to have batches, you need storage, and to be able to replay batches, you need reliable storage. XAP is about the highest performing, reliable source of data out there, so a spout that serves tuples from XAP is a natural combination. Recall that Storm is a stream processing framework and runtime, and this presupposes the existence of a stream for it to read from. So there are really two artifacts needed for XAP to provide a spout to Storm: a "stream" in XAP, and of course the spout that reads from it. Realizing this, I wrote a simple service for XAP that leverages XAP's FIFO capabilities called XAPStream. It is a standalone (Storm independent) service that lets clients dynamically create, destroy, and of course read and write from streams in both batch and non-batch modes.

Big data and noSQL in real time

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Big data and noSQL in real time

Notas do Editor