Real-Time-Analytics mit Spark und Cassandra

Real-Time
Analytics
mit Spark
und Cassandra
WidasConcepts Unternehmensberatung GmbH Ÿ Maybachstraße 2 Ÿ 71299 Wimsheim Ÿ http://www.widas.de
im März 2015
OSBI – Workshop
http://www.osbi-workshop.de/

3
© WidasConcepts
Real-Time Analytics?

4
© WidasConcepts
Real-Time Analytics mit Spark und Cassandra

6
© WidasConcepts
Cassandra
Vorteile Cassandra: massiv skalierbare verteilte Datenbank
CAP frei einstellbar, für Analytics: AP
Shared Nothing, Peer to Peer
KKV – Wide Columnar/Partitions
Zeitreihen optimierte Datenmodelle
In Memory Tabellen
Daten-Lokalität mit Wide Partitions
1
2
3
4
5
6

7
© WidasConcepts
Cassandra – Ring Struktur
Jeder Knoten in Cassandra ist äquivalent ansprechbar
Konfigurierbare Replikation (lokal, DC-weise)
1
2
34
5
Client

8
© WidasConcepts
  „Can‘t Fail, Must Scale“ –System
  Datenreplikation sichert Verfügbarkeit
  Knotenausfall wird automatisch
behandelt
Verfügbarkeit
1
2
34
5
Client

9
© WidasConcepts
Skalierbarkeit Cassandra im Vergleich
Quelle: Planet Cassandra
Anzahl Operationen (Read/Write) pro Sec – Anzahl Knoten

11
© WidasConcepts
Spark
Kernelemente
Verarbeitung im DAG (Directed Acyclic Graph)
Resilent Distributed Datasets
Scala
lokale JVM Prozesse auf den Knoten
parallele Transformationen/Aktionen auf RDDs
Operationen: map, filter, groupBy …
Aktionen: count, collect, save …
1
2
3
4
5
6

12
© WidasConcepts
Spark - RDD
immutable
partitioned
logical collection of records
rebuildable
materialized in memory
cached for future reuse
1
2
3
4
5
6

13
© WidasConcepts
Spark – RDD – Transformationen und Aktionen (parallel)
Transformationen
map
filter
groupByKey
join
…
Aktionen
reduce
collect
count
lookupKey
…

14
© WidasConcepts
Spark – RDD – Resilent
speichern ihre Herkunft (Lineage)
damit kann jederzeit bei Ausfall
die entsprechende Partition der RDD neu aufgebaut werden
HdfsRDD
path: hdfs://…
FilteredRDD
func:
contains(...)
MappedRDD
func: split(…)
CachedRDD

15
© WidasConcepts
Spark Performanz im Vergleich
Logistic Regression
127
s
/
iteration

ﬁrst
iteration
174
s

further
iterations
6
s

Quelle: University of California, Berkeley

16
© WidasConcepts
Spark Stack
Spark

Spark

Streaming

real-‐time

Verarbeitung

von
Daten
in

“micro”

Batches

Spark

SQL

HiveQL

kompatibel

MLLib

machine

learning

Classiﬁcation

Clustering

Regressing

col.
Filtering

GraphX

spez.
RDDs

Operationen

PageRank

SVD++

18
© WidasConcepts
Integration – Spark & Cassandra
mit Spark-Cassandra Connector
Cassandra Tabellen sind als RDDs verfügbar
auf jedem Cassandra Knoten wird ein Spark Executor eingesetzt
1
2
34
5

19
© WidasConcepts
Vorteile der Integration zwischen Spark und Cassandra
Daten Lokalität, Token-Aware
Spark RDDs auf In Memory C* Tabellen
SQL auf Cassandra (Joins!)
Datenbank-basierte Filter in Spark
Spark Streaming wird unterstützt
Beide Richtungen: Read and Write
1
2
3
4
5
6

22
© WidasConcepts
Spark Streaming
Integrierbar mit Cassandra/Spark Treiber
Micro Batches (1 Sek), Discretized Streams
Exactly Once Semantik
RDD Funktionalität
1
2
3
4
Integration diverser MQ (z.B. Kafka)5

24
© WidasConcepts
Weiterentwicklungen im Spark Umfeld
SparkR, PySpark
Spark Integration in R
lapply Implementierung
kann in Closures in R verwendet werden
1
2
3
4
Interaktives R mit Spark möglich5
auf Daten in Cassandra6

WidasConcepts
HighEnd-Technology requires HighEnd-Competence
(Wir beraten Sie gerne
WidasConcepts GmbH
Maybachstraße 2
71299 Wimsheim
www.widas.de
30
Dieses Dokument wurde von WidasConcepts erstellt. Die Verteilung, Zitierung und Vervielfältigung – auch auszugsweise – zum Zwecke der
Weitergabe an Dritte ist nur mit vorheriger schriftlicher Zustimmung von WidasConcepts gestattet.
This presentation was created by WidasConcepts. Distribution, citation, copying - completely or in extracts – for transfer purposes, is only
permitted with prior written agreement. These abstracts and graphics were deployed by WidasConcepts within the scope of a presentation.
It is no complete documentation of this event.
Thomas Mann, Solution Architect
Telefon: +49 (7044) 95103 – 100
Mobile: +49 162 259 56 90
Mail: thomas.mann@widas.de

Real-Time-Analytics mit Spark und Cassandra

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Real-Time-Analytics mit Spark und Cassandra

Ähnlich wie Real-Time-Analytics mit Spark und Cassandra (20)

Real-Time-Analytics mit Spark und Cassandra