Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

•Download as PPTX, PDF•

1 like•787 views

0.) With each device comes an implicit contract with the end user: you give us the data, we give you the results. Now. Not tomorrow. Not even fifteen minutes from now. 1.) The flip side of getting data in real time is that users expect results in real time. 2.) In return for being wired into the internet 24x7 customers demand a similar level of responsiveness and even better availability. __ Spark Streaming and Cassandra form the ideal combination of high velocity CEP and analytics with a high velocity and always on database. Today’s solutions don’t scale to the Internet of tomorrow. The always-on nature of the emerging Internet of Things space means you need to process information at previously unseen scale and, more difficult, make sense out of that data. Cassandra is the leader in large scale, high velocity, time series data workloads. While the Hadoop world has been stuck with legacy “batch analytics” technology, Cassandra users have been increasingly focused on the “now”. Fast answers to easy questions about your data, at any velocity, and any scale. But Cassandra has always been weak on the “complex questions” problem. DataStax integrated with Hadoop to overcome this limitation, but it was always an awkward fit. Slow batch analytics on top of fast moving data really doesn’t do you much good. But Spark, and in this case, Spark Streaming, make high velocity streaming analytics at scale easier than ever, similar to how Cassandra pioneered high-velocity data management at scale. Hadoop is the right choice for batch analytics. Until recently, nobody really knew what the right solution is for real-time processing. We believe that Spark and Cassandra are the clear answer.

Technology

A Tale of Two
Technologies
A Story of the IoT “revolution”

It was the age of disconnectedness
Failure tolerance
is not optional

It was the epoch of complexity
Did you just tell me to go
#@$#@$ myself?

It was the epoch of simplicity
We brought as much simplicity as
we could to the menagerie known
as Hadoop, but it’s still slow.

What is the Internet of Things?
“The Internet of Things (IoT) refers to uniquely identifiable objects and their
virtual representations in an Internet-like structure.”
-Wikipedia

No really what is IoT?
● It’s literally the act of connecting “things” to
the Internet
● It predates the World Wide Web
● It shouldn’t be surprising to anybody

So IoT is just hype?
Home security and automation
Fitness trackers
Your (driverless?) car
Medical devices
Sensors
Industrial equipment
NO!

What is Cassandra?
● A massively scalable distributed database
● Chooses availability over strong consistency
(yes, that really is a fundamental tradeoff)
● With its wide partitions it is able to take
advantage of data locality even at “web
scale”
Chocolate!

What is Spark?
● DAG is a logical superset of M/R
● Adopts much of the Hadoop ecosystem,
without being bound by it
● Intelligent use of caching (RDDs) for
massive performance gains
● Incorporate Streaming to make ingestion-
time processing a first class citizen
Peanut butter!

What does IoT need from big data?
● Log time-series events -- at scale
● Gather meaning from that data -- at scale
● Report on that data -- at scale
● Take action on that data -- at scale

Reporting at scale
● Canned reports
● Ad-hoc querying and reporting
● Drill down / exploratory
● Alerting
● Aggregation
● Clustering (K-means, et al)
● Generalized machine learning

Take action at scale
● Stateless application servers
● Horizontally scaled and co-located with
Cassandra and Spark in each DC
● Any platform with a CQL driver

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Slack Application Development 101 Slidespraypatel2

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Artificial Intelligence: Facts and MythsJoaquim Jorge

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Slack Application Development 101 Slides

A Domino Admins Adventures (Engage 2024)

Powerful Google developer tools for immediate impact! (2023-24 C)

CNv6 Instructor Chapter 6 Quality of Service

Scaling API-first – The story of a global engineering organization

Artificial Intelligence: Facts and Myths

Data Cloud, More than a CDP by Matt Robison

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Handwritten Text Recognition for manuscripts and early printed texts

08448380779 Call Girls In Civil Lines Women Seeking Men

Presentation on how to chat with PDF using ChatGPT code interpreter

Boost Fertility New Invention Ups Success Rates.pdf

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Breaking the Kubernetes Kill Chain: Host Path Mount

Axa Assurance Maroc - Insurer Innovation Award 2024

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

1. A Tale of Two Technologies A Story of the IoT “revolution”

2. It was the age of connectedness

3. It was the age of disconnectedness Failure tolerance is not optional

4. It was the age of wisdom

5. It was the age of foolishness

6. It was the epoch of complexity Did you just tell me to go #@$#@$ myself?

7. It was the epoch of simplicity We brought as much simplicity as we could to the menagerie known as Hadoop, but it’s still slow.

8. What is the Internet of Things? “The Internet of Things (IoT) refers to uniquely identifiable objects and their virtual representations in an Internet-like structure.” -Wikipedia

9. No really what is IoT? ● It’s literally the act of connecting “things” to the Internet ● It predates the World Wide Web ● It shouldn’t be surprising to anybody

10. So IoT is old news? Most definitely

11. So IoT is just hype? Home security and automation Fitness trackers Your (driverless?) car Medical devices Sensors Industrial equipment NO!

12. What is Cassandra? ● A massively scalable distributed database ● Chooses availability over strong consistency (yes, that really is a fundamental tradeoff) ● With its wide partitions it is able to take advantage of data locality even at “web scale” Chocolate!

13. What is Spark? ● DAG is a logical superset of M/R ● Adopts much of the Hadoop ecosystem, without being bound by it ● Intelligent use of caching (RDDs) for massive performance gains ● Incorporate Streaming to make ingestion- time processing a first class citizen Peanut butter!

14. What does IoT need from big data? ● Log time-series events -- at scale ● Gather meaning from that data -- at scale ● Report on that data -- at scale ● Take action on that data -- at scale

15. Logging events at scale

16. Gathering meaning at scale

17. Reporting at scale ● Canned reports ● Ad-hoc querying and reporting ● Drill down / exploratory ● Alerting ● Aggregation ● Clustering (K-means, et al) ● Generalized machine learning

18. Take action at scale ● Stateless application servers ● Horizontally scaled and co-located with Cassandra and Spark in each DC ● Any platform with a CQL driver

19. The architecture...

20. Spark Streaming to Cassandra

21. Cassandra to Spark Streaming

22. Multi-DC DC1 DC2 Write anywhere.

23. Things that go together

24. Things that go together

25. A Tale of Two Summits

Editor's Notes

Rant about strong consistency
1990: John Romkey created the first Internet ‘device’, a toaster that could be turned on and off over the Internet. 1991: The first web page was created by Tim Berners-Lee
Cassandra powers devices in all of these spaces today
We can scale to whatever you need.
Spark streaming is the real time data enrichment layer. Data augmentation, anomaly detection, denormalization, etc.
Even without augmentation, individual device timelines are stored in wide partitions, hence fast to use Spark to
Application logic/servers co-located with each Spark/Cassandra DC. Output of any Spark job can be a message to the application servers to take action.
So what does the deployment look like?
Talk about message queues, and methods of getting data to spark
Talk about the Cassandra API, triggers, and the path to write back to Cassandra
Write locally, interact globally. Replication factor per keyspace per dc.
Look for an announcement at the Spark Summit on Monday the 30th. Much more to come at the Cassandra Summit in September.

Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Recommended

Recommended

More Related Content

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Cassandra Day NY 2014: Using Spark Streaming for High Velocity Analytics on Cassandra

Editor's Notes