The Big Data industry emerged in response to the unprecedented sizes of data sets collected by Internet companies and the particular needs they had to store and use that data.
Today, the need to process that data more quickly is morphing Big Data architectures into Fast Data architectures. This session discusses the forces driving this trend and the most popular tools that have emerged to address particular design challenges:
Spark - For sophisticated processing of data streams, as well as traditional batch-mode processing.
Kafka - For durable and scalable ingestion and distribution of data streams.
Cassandra - For scalable, flexible persistence.
Reactive Platform: Lagom, Akka, and Play - For integration of other components and building microservices.
Mesos - For cluster resource management.
---
About the presenter:
Dean Wampler, Ph.D. is the Architect for Big Data Products and Services and a member of the office of the CTO at Lightbend. He is designing the product strategy and technical architecture for Lightbend's Spark on Mesos products and emerging streaming tools built around Spark and Lightbend’s ConductR and Akka products. Dean has written books on Scala, Functional Programming, and Hive for O'Reilly. He speaks at and co-organizes many industry conferences. He also organizes several Chicago-area user groups and contributes to many open-source projects, including Apache Spark. Dean has a Ph.D. in Physics from the University of Washington.
11. Hadoop Strengths
• Lowest CapEx system for Big Data.
• Excellent for ingesting and integrating diverse datasets.
• Flexible: from classic analytics (aggregations and data warehousing) to
machine learning.
11
21. SQL queries and a “DataFrame” DSL
21
Spark Streaming
(~Real Time)
MLlib
(Machine Learning)
SQL/DataFrames
(Structured Data)
GraphX
(Graphs)
Spark RDD
(Core)
• For data with a fixed schema...
• Write SQL queries (currently a subset of HiveQL).
• Use equivalent Python-inspired DataFrame API.
22. Use SQL or the Idiomatic DataFrame API
22
# SQL:
sqlContext.sql("""
SELECT state, age, COUNT(*) AS cnt
FROM people
GROUP BY state, age
ORDER BY cnt DESC, state ASC, age ASC
""")
// DataFrame (Scala):
people.state($"state", $"age")
.groupBy($"state", $"age").count()
.orderBy($"count".desc, $"state".asc, $"age".asc)
27. •Update a search engine in real time as web page or
documents change.
•Train a SPAM filter with every email.
•Detect anomalies as they happen through processing of
logs and monitoring data.
Fast as in Streaming. Why?
40. •Next Steps
•Learn - Fast Data: Big Data Evolved
•Watch - Using Spark, Kafka, Cassandra and Akka on
Mesos for Real-Time Personalization
•Review - Spark success stories by Lightbend clients