SlideShare a Scribd company logo
1 of 27
Download to read offline
Rocco Varela
Software Engineer in Test, DataStax Analytics Team
Spark Streaming Fault Tolerance with DataStax Enterprise 5.0
© DataStax, All Rights Reserved.
1 Spark Streaming fault-tolerance with the DataStax
2 Testing for the real world
3 Best practices and lessons learned
4 Questions and Answers
2
© DataStax, All Rights Reserved.
DSE supports Spark Streaming fault-tolerance
DataStax is a great place to run Spark Streaming, but applications can die and may need special care.
The Spark-Cassandra stack on DataStax Enterprise (DSE) provides all the fault-tolerant requirements
needed for Spark Streaming.
3
Driver
Executor
Launch
jobs using
offset
ranges
Spark
Streaming
C*
Read data using offset
ranges in jobs using Direct
API
Query latest offsets and
decide offset ranges for
batch
Kafka
Spark Streaming using
Kafka Direct API
© DataStax, All Rights Reserved.
We should be prepared for driver failures
If the driver node running the Spark Streaming application fails, then the Spark Context is lost, and all
executors with their in-memory data are lost as well.
How can we protect ourselves from this scenario?
4
Driver
Executor
Launch
jobs using
offset
ranges
Spark
Streaming
C*
Read data using offset
ranges in jobs using Direct
API
Query latest offsets and
decide offset ranges for
batch
Kafka
Spark Streaming using
Kafka Direct API
X
© DataStax, All Rights Reserved.
Checkpoints are used for recovery automatically
Checkpointing is basically saving a snapshot of an application’s state, which can be used for recovery in case of failure.
DataStax Enterprise Filesystem (DSEFS) is our reliable storage mechanism, new in DSE 5.0.
We can run Spark Streaming with fault-tolerance without HDFS, or any other filesystem.
5
Driver
Executor
Launch
jobs using
offset
ranges
Spark
Streaming
C*
Read data using offset
ranges in jobs using Direct
API
Query recovered offsets
and decide offset ranges for
batch
Kafka
Spark Streaming using
Kafka Direct API
DSEFS
Updates
metadata
Retrieves
checkpoint
1
Write
checkpoint
5
1
4
2
3
5
✓
© DataStax, All Rights Reserved.
DataStax Enterprise Filesystem is our reliable storage
DataStax Enterprise Filesystem (DSEFS) is our
reliable storage mechanism, new in DSE 5.0.
At the right shows a snapshot of the DSEFS shell
with some utilities we can use to examine our data.
Improvements over Cassandra Filesystem (CFS)
include:
• Broader support for the HDFS API:
• append, hflush, getFileBlockLocations
• More fine-grained file replication
• Designed for optimal performance
6
$ dse fs
dsefs / > help
append Append a local file to a remote file
cat Concatenate files and print on the
standard output
cd Change remote working directory
df List file system status and disk space
usage
exit Exit shell client
fsck Check filesystem and repair errors
get Copy a DSE file to the local filesystem
help Display this list
ls List directory contents
mkdir Make directories
mv Move a file or directory
put Copy a local file to the DSE filesystem
rename Rename a file or directory without moving
it to a different directory
rm Remove files or directories
rmdir Remove empty directories
stat Display file status
truncate Truncate files
umount Unmount file system storage locations
DSEFS Shell
© DataStax, All Rights Reserved.
We can benchmark DSEFS with cfs-stress
7
$ cfs-stress [options] filesystem_directory
$ cfs-stress -s 64000 -n 1000 -t 8 -r 2 -o WR --shared-cfs dsefs:///stressdir
Warming up...
Writing and reading 65536 MB to/from dsefs:/cfs-stress-test in 1000 files.
progress bytes curr rate avg rate max latency
0.7% 895.9 MB 0.0 MB/s 983.4 MB/s 35.354 ms
0.8% 1048.6 MB 984.2 MB/s 546.4 MB/s 23.390 ms
...
99.4% 130252.9 MB 652.3 MB/s 764.3 MB/s 162.944 ms
99.8% 130761.1 MB 668.5 MB/s 762.8 MB/s 23.540 ms
100.0% 131072.0 MB 491.2 MB/s 762.7 MB/s 5.776 ms
http://docs.datastax.com/en/latest-dse/datastax_enterprise/tools/CFSstress.html?hl=cfs,stress
Example output:
Data Output Description
progress Total progress of the stress operation
bytes Total bytes written/read
curr rate Current rate of bytes being written/read per second
avg rate Average rate of bytes being written/read per second
max latency Maximum latency in milliseconds during the current reporting window
$ cfs-stress [options] filesystem_directory
Usage:
Short
form
Long form Description
-s --size Size of each file in KB. Default 1024.
-n --count
Total number of files read/written. Default
100.
-t --threads Number of threads.
-r --streams
Maximum number of streams kept open
per thread. Default 2.
-o --operation
(R)ead, (W)rite, (WR) write & read,
(WRD) write, read, delete
Options (truncated):
© DataStax, All Rights Reserved.
We can deploy multiple DSE Filesystems across
multiple data centers
Each datacenter should maintain a separate instance of DSEFS.
8
https://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/configUsingDsefs.html#configUsingDsefs__configMultiDC
dsefs_options:
...
keyspace_name: dsefs1
ALTER KEYSPACE dsefs1 WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC1': '3'
};
dsefs_options:
...
keyspace_name: dsefs2
ALTER KEYSPACE dsefs2 WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC2': '3'
};
1. In the dse.yaml files, specify a separate
DSEFS keyspace for each data center.
On each DC1 node:
On each DC2 node:
2. Set the keyspace replication appropriately for
each datacenter.
On each DC1 node:
On each DC2 node:
© DataStax, All Rights Reserved.
DSE Filesystem supports JBOD
• Data directories can be setup with JBOD
(Just a bunch of disks)
• storage_weight controls how much data is
stored in each directory relative to other
directories
• Directories should be mounted on different
physical devices
• If a server fails, we can move the storage to
another node
• Designed for efficiency:
• Does not use JVM heap for FS data
• Uses async I/O and Cassandra queries
• Uses thread-per-core threading model
9
DSE File System Options (dse.yaml)
dsefs_options:

enabled: true



...


data_directories:

- dir: /mnt/cass_data_disks/dsefs_data1

# How much data should be placed in this directory relative to
# other directories in the cluster

storage_weight: 1.0

...
- dir: /mnt/cass_data_disks/dsefs_data2

# How much data should be placed in this directory relative to
# other directories in the cluster

storage_weight: 5.0

...
© DataStax, All Rights Reserved.
We use an email messaging app to test fault-tolerance
10
CREATE TABLE email_msg_tracker (
msg_id text,
tenant_id uuid,
mailbox_id uuid,
time_delivered timestamp,
time_forwarded timestamp,
time_read timestamp,
time_replied timestamp,
PRIMARY KEY ((msg_id, tenant_id), mailbox_id)
)
Data Model
https://github.com/datastax/code-samples/tree/master/
Application Setup
data
© DataStax, All Rights Reserved.
Enabling fault-tolerance in Spark is simple
1. On startup for the first time, it will create
a new StreamingContext.
2. Within our context creation method we
configure checkpointing with
ssc.checkpoint(path).
3. We will also define our DStream in this
function.
4. When the program is restarted after
failure, it will re-create a StreamingContext
from the checkpoint.
11
Spark Streaming Application Setup
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
StreamingContext
NO YES
© DataStax, All Rights Reserved.
Enabling the DSE Filesystem is even easier
Configurations:
1. Enable DSEFS
2. Keyspace name for storing metadata
• metadata is stored in Cassandra
• no SPOF, always-on, scalable, durable, etc.
3. Working directory
• data is tiny, no special throughput,
latency or capacity requirements
4. Data directory
• JBOD support
• Designed for efficiency
12
4
3
2
1
DSE File System Options (dse.yaml)
dsefs_options:

enabled: true



# Keyspace for storing the DSE FS metadata

keyspace_name: dsefs



# The local directory for storing node-local metadata

# The work directory must not be shared by DSE FS nodes.

work_dir: /mnt/data_disks/data1/dse/dsefs

. . .


data_directories:

- dir: /var/lib/dsefs/data

# How much data should be placed in this directory relative to
# other directories in the cluster

storage_weight: 1.0

# Reserved space (in bytes) that is not going to be used for
# storing blocks

min_free_space: 5368709120
© DataStax, All Rights Reserved.
Checkpoint files are managed automatically
These files are automatically managed by our application and DataStax Filesystem (DSEFS).
Under the hood, these are some files we expect to see:
• checkpoint-1471492965000: serialized checkpoint file containing data used for recovery
• a2f2452d-3608-43b9-ba60-92725b54399a: directory used for storing Write-Ahead-Log data
• receivedBlockMetadata: used for Write-Ahead-Logging to store received block metadata
13
# Start streaming application with DSEFS URI and checkpoint dir
$ dse spark-submit 
—class sparkAtScale.StreamingDirectEmails 
path/to/myjar/streaming-assembly-0.1.jar 
dsefs:///emails_checkpoint
$ dse fs
dsefs / > ls -l emails_checkpoint
Type ... Modified Name
dir ... 2016-08-17 21:02:40-0700 a2f2452d-3608-43b9-ba60-92725b54399a
dir ... 2016-08-17 21:02:50-0700 receivedBlockMetadata
file ... 2016-08-17 21:02:49-0700 checkpoint-1471492965000
© DataStax, All Rights Reserved.
DSE Filesystem stands up real world challenges
DSE Filesystem demonstrated stable perfomance.
In this experiment we tested a 3-node cluster streaming data from Kafka.
The summary on the right shows stable input rate and processing time.
14
Metadata checkpointing (Kafka Direct API)
• DSEFS Rep. Factor: 3
• Kafka/Spark partitions: 306
• Spark cores/node: 34
• Spark Executor Mem: 2G
Streaming
• Rate: ~67K events/sec
• Batch Interval: 10 sec
• Total Delay: 5.5 sec
Driver
Executo
Spark Streaming
C*
DSEF
Kafka
ExecutoC*
DSEF Spark
ExecutoC*
DSEF Spark
Cluster Setup
Streaming Statistics
© DataStax, All Rights Reserved.
Tuning Spark Streaming in a nutshell
Tuning Spark Streaming boils down to balancing the input rate and
processing throughput.
At a high level, we need to consider two things:
1. Reducing the processing time of each batch
2. Setting the right batch size so batches can be processed as
fast as they are received
What are some ways we can accomplish these goals?
15
© DataStax, All Rights Reserved.
Reduce processing time by increasing parallelism
16
Exaggerated example obviously, main point:
Proper parallelism can achieve better performance and
stability.
Processing Time: The time to process each batch of data.
Both experiments had the same configs:
• input rate: ~85K events/sec
• batch interval: 5 sec
• maxRatePerPartition: 100K
Experiment #2 has a better:
• workload distribution
• lower processing time and delays
• processing time remains below the batch interval
• overall better stability
If we see Exp. #1 behavior even with parallelism set
properly, we need to exam CPU utilization more closely.
Experiment #1: 1 Spark partition
Experiment #2: 100 Spark partitions
Spark Streaming Statistics
© DataStax, All Rights Reserved.
Reduce processing time by increasing parallelism
17
Spark Streaming Statistics
Exaggerated example obviously, main point:
Proper parallelism can achieve better performance and
stability.
Scheduling Delay: the time a batch waits in a queue for
the processing of previous batches to finish.
Both experiments had the same configs:
• input rate: ~85K events/sec
• batch interval: 5 sec
• maxRatePerPartition: 100K
Experiment #2 has a better:
• workload distribution
• lower processing time and delays
• processing time remains below the batch interval
• overall better stability
If we see Exp. #1 behavior even with parallelism set
properly, we need to exam CPU utilization more closely.
Experiment #1: 1 Spark partition
Experiment #2: 100 Spark partitions
© DataStax, All Rights Reserved.
Reduce processing time by increasing parallelism
18
Spark Streaming Statistics
Exaggerated example obviously, main point:
Proper parallelism can achieve better performance and
stability.
Total Delay: Scheduling Delay + Processing Time
Both experiments had the same configs:
• input rate: ~85K events/sec
• batch interval: 5 sec
• maxRatePerPartition: 100K
Experiment #2 has a better:
• workload distribution
• lower processing time and delays
• processing time remains below the batch interval
• overall better stability
If we see Exp. #1 behavior even with parallelism set
properly, we need to exam CPU utilization more closely.
Experiment #1: 1 Spark partition
Experiment #2: 100 Spark partitions
© DataStax, All Rights Reserved.
Setting the right batch interval depends on the setup
19
Experiment #2 is a good example of stability:
• Batch Interval: 5 sec
• Avg. Processing Time: 3.3 sec
• Avg. Scheduling Delay: 135 ms
• Avg. Total Delay: 3.4 sec
• Minimal scheduling delay
• Consistent processing time below batch interval
A good approach is to test with a conservative batch
interval (say, 5-10s) and a low data input rate.
If the processing time < batch interval, then the
system is stable.
If the processing time > batch interval OR
scheduling delay increases, then reduce the
processing time.
Spark Streaming Statistics
Experiment #2: 100 Spark partitions
© DataStax, All Rights Reserved.
Consider this scenario: The streaming app is ingesting data and checkpointing without
issues, but when attempting to recover from a saved checkpoint the following error occurs.
What does this mean? Let’s look at our setup.
20
ERROR … org.apache.spark.streaming.StreamingContext: Error starting the context, marking it as stopped
org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@45984654 has
not been initialized
at scala.Option.orElse(Option.scala:257) ~[scala-library-2.10.5.jar:na]
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
…
Exception in thread "main" org.apache.spark.SparkException:
org.apache.spark.streaming.kafka.DirectKafkaInputDStream@45984654 has not been initialized
at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321)
…
How to implement checkpointing correctly
© DataStax, All Rights Reserved. 21
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
How to implement checkpointing correctly
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
Setup BSetup A
Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized
Hint: In red shows where the input stream is created, notice the different location between examples.
NO YES
NO YES
© DataStax, All Rights Reserved. 22
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
How to implement checkpointing correctly
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
Setup BSetup A
Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized
Hint: In red shows where the input stream is created, notice the different location between examples.
NO YES
NO YES
We want the stream to be recoverable, so we
need to move this into createStreamingContext().
The problem here is that we are creating the
stream in the wrong place.
© DataStax, All Rights Reserved. 23
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
How to implement checkpointing correctly
Define DStream
Set Checkpoint Path
Create
createStreamingContext()
Checkpoint
Recover Checkpoint
Start
Start
Setup BSetup A
Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized
Hint: In red shows where the input stream is created, notice the different location between examples.
NO YES
NO YES
We want just about
everything in this function.
© DataStax, All Rights Reserved. 24
How to implement checkpointing correctly
Code Sample
B
def main(args: Array[String]) {



val conf = new SparkConf() // with addition settings as needed

val sc = SparkContext.getOrCreate(conf)
val sqlContext = SQLContext.getOrCreate(sc)

import sqlContext.implicits._
def createStreamingContext(): StreamingContext = {

val newSsc = new StreamingContext(sc, Seconds(batchInterval))

newSsc.checkpoint(checkpoint_path)

newSsc

}



val ssc = StreamingContext.getActiveOrCreate(checkpoint_path,
createStreamingContext)

. . .

val emailsStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topics)



. . .


ssc.start()

ssc.awaitTermination()

}
Code Sample
A
Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized
Hint: In red shows where the input stream is created, notice the different location between examples.
: Highlights the method for creating the streaming context.
https://github.com/datastax/code-samples/tree/master/
def main(args: Array[String]) {



val conf = new SparkConf() // with addition settings as needed

val sc = SparkContext.getOrCreate(conf)
val sqlContext = SQLContext.getOrCreate(sc)

import sqlContext.implicits._
def createStreamingContext(): StreamingContext = {

val newSsc = new StreamingContext(sc, Seconds(batchInterval))

newSsc.checkpoint(checkpoint_path)
. . .

val emailsStream = KafkaUtils.createDirectStream[String,
String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)


. . .
newSsc

}



val ssc = StreamingContext.getActiveOrCreate(checkpoint_path,
createStreamingContext)



ssc.start()

ssc.awaitTermination()

}
Code Sample
B
© DataStax, All Rights Reserved.
General tips to keep things running smoothly
• One DSE node can run at most one DSEFS server.
• In production, always set RF >= 3 for the DSEFS keyspace.
• For optimal performance, organize local DSEFS data on different physical drives than
the Cassandra database.
• Currently DSEFS is intended for Spark streaming use cases, support for broader use
cases is underway.
25
http://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/aboutDsefs.html?hl=dsefs
© DataStax, All Rights Reserved.
DSE Filesystem supports all Spark fault-tolerant goals
1.DataStax Enterprise Filesystem (DSEFS) can support all our needs for Spark Streaming fault-tolerance
(Metadata checkpointing, Write-ahead-logging, RDD checkpointing).
2.Setting up the DSEFS is very straight forward, and provides common utilities for managing data.
3.We covered some details to consider when build a Spark Streaming App.
4.We can use existing tools such as cfs-stress to benchmark I/O before deployment.
5.Many more features are currently underway for upcoming releases of DSEFS:
• Enhanced security
• Hardening for long-term storage: fsck improvements, checksumming
• Performance improvements: Compression, data locality improvements and more
• Access to other types of filesystems: HDFS, local file system from the DSEFS shell
• User-experience improvements in the DSEFS shell: tab-completion, wildcards, etc.
26
Thank You
• Piotr Kołaczkowski (DSEFS Architect and Lead Developer)
• Brian Hess (Product Management)
• Jarosław Grabowski (DSEFS Developer)
• Russell Spitzer (DSE Analytics Developer)
• Ryan Knight (Solutions Engineer)
• Giovanny Castro (DSE Test Engineer)
• DSE Analytics Team
https://github.com/datastax/code-samples/tree/master/

More Related Content

What's hot

DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyondMatija Gobec
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...DataStax
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japanHiromitsu Komatsu
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...DataStax
 
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...DataStax
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in CassandraShogo Hoshii
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchDataStax Academy
 

What's hot (20)

DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japan
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a HitchMigration Best Practices: From RDBMS to Cassandra without a Hitch
Migration Best Practices: From RDBMS to Cassandra without a Hitch
 

Viewers also liked

Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...
Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...
Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...DataStax
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphsandyseaborne
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
 
Big Data Solutions Executive Overview
Big Data Solutions Executive OverviewBig Data Solutions Executive Overview
Big Data Solutions Executive OverviewRCG Global Services
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax
 
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...DataStax
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesAerospike, Inc.
 
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)DataStax Academy
 
Successful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraSuccessful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraDataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 

Viewers also liked (20)

Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...
Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...
Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...
 
Graph Data -- RDF and Property Graphs
Graph Data -- RDF and Property GraphsGraph Data -- RDF and Property Graphs
Graph Data -- RDF and Property Graphs
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Big Data Solutions Executive Overview
Big Data Solutions Executive OverviewBig Data Solutions Executive Overview
Big Data Solutions Executive Overview
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
 
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Successful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraSuccessful Software Development with Apache Cassandra
Successful Software Development with Apache Cassandra
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 

Similar to DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela) | Cassandra Summit 2016

Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on MesosJoe Stein
 
Oaklands college: Protecting your data.
Oaklands college: Protecting your data.Oaklands college: Protecting your data.
Oaklands college: Protecting your data.JISC RSC Eastern
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualizationFranck Pachot
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiUnmesh Baile
 
Understanding the Windows Server Administration Fundamentals (Part-2)
Understanding the Windows Server Administration Fundamentals (Part-2)Understanding the Windows Server Administration Fundamentals (Part-2)
Understanding the Windows Server Administration Fundamentals (Part-2)Tuan Yang
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsGilHecht
 

Similar to DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela) | Cassandra Summit 2016 (20)

Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 
Daos
DaosDaos
Daos
 
Oaklands college: Protecting your data.
Oaklands college: Protecting your data.Oaklands college: Protecting your data.
Oaklands college: Protecting your data.
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Migration from 8.1 to 11.3
Migration from 8.1 to 11.3Migration from 8.1 to 11.3
Migration from 8.1 to 11.3
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
 
Understanding the Windows Server Administration Fundamentals (Part-2)
Understanding the Windows Server Administration Fundamentals (Part-2)Understanding the Windows Server Administration Fundamentals (Part-2)
Understanding the Windows Server Administration Fundamentals (Part-2)
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed Gaps
 

More from DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 

Recently uploaded (20)

Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 

DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela) | Cassandra Summit 2016

  • 1. Rocco Varela Software Engineer in Test, DataStax Analytics Team Spark Streaming Fault Tolerance with DataStax Enterprise 5.0
  • 2. © DataStax, All Rights Reserved. 1 Spark Streaming fault-tolerance with the DataStax 2 Testing for the real world 3 Best practices and lessons learned 4 Questions and Answers 2
  • 3. © DataStax, All Rights Reserved. DSE supports Spark Streaming fault-tolerance DataStax is a great place to run Spark Streaming, but applications can die and may need special care. The Spark-Cassandra stack on DataStax Enterprise (DSE) provides all the fault-tolerant requirements needed for Spark Streaming. 3 Driver Executor Launch jobs using offset ranges Spark Streaming C* Read data using offset ranges in jobs using Direct API Query latest offsets and decide offset ranges for batch Kafka Spark Streaming using Kafka Direct API
  • 4. © DataStax, All Rights Reserved. We should be prepared for driver failures If the driver node running the Spark Streaming application fails, then the Spark Context is lost, and all executors with their in-memory data are lost as well. How can we protect ourselves from this scenario? 4 Driver Executor Launch jobs using offset ranges Spark Streaming C* Read data using offset ranges in jobs using Direct API Query latest offsets and decide offset ranges for batch Kafka Spark Streaming using Kafka Direct API X
  • 5. © DataStax, All Rights Reserved. Checkpoints are used for recovery automatically Checkpointing is basically saving a snapshot of an application’s state, which can be used for recovery in case of failure. DataStax Enterprise Filesystem (DSEFS) is our reliable storage mechanism, new in DSE 5.0. We can run Spark Streaming with fault-tolerance without HDFS, or any other filesystem. 5 Driver Executor Launch jobs using offset ranges Spark Streaming C* Read data using offset ranges in jobs using Direct API Query recovered offsets and decide offset ranges for batch Kafka Spark Streaming using Kafka Direct API DSEFS Updates metadata Retrieves checkpoint 1 Write checkpoint 5 1 4 2 3 5 ✓
  • 6. © DataStax, All Rights Reserved. DataStax Enterprise Filesystem is our reliable storage DataStax Enterprise Filesystem (DSEFS) is our reliable storage mechanism, new in DSE 5.0. At the right shows a snapshot of the DSEFS shell with some utilities we can use to examine our data. Improvements over Cassandra Filesystem (CFS) include: • Broader support for the HDFS API: • append, hflush, getFileBlockLocations • More fine-grained file replication • Designed for optimal performance 6 $ dse fs dsefs / > help append Append a local file to a remote file cat Concatenate files and print on the standard output cd Change remote working directory df List file system status and disk space usage exit Exit shell client fsck Check filesystem and repair errors get Copy a DSE file to the local filesystem help Display this list ls List directory contents mkdir Make directories mv Move a file or directory put Copy a local file to the DSE filesystem rename Rename a file or directory without moving it to a different directory rm Remove files or directories rmdir Remove empty directories stat Display file status truncate Truncate files umount Unmount file system storage locations DSEFS Shell
  • 7. © DataStax, All Rights Reserved. We can benchmark DSEFS with cfs-stress 7 $ cfs-stress [options] filesystem_directory $ cfs-stress -s 64000 -n 1000 -t 8 -r 2 -o WR --shared-cfs dsefs:///stressdir Warming up... Writing and reading 65536 MB to/from dsefs:/cfs-stress-test in 1000 files. progress bytes curr rate avg rate max latency 0.7% 895.9 MB 0.0 MB/s 983.4 MB/s 35.354 ms 0.8% 1048.6 MB 984.2 MB/s 546.4 MB/s 23.390 ms ... 99.4% 130252.9 MB 652.3 MB/s 764.3 MB/s 162.944 ms 99.8% 130761.1 MB 668.5 MB/s 762.8 MB/s 23.540 ms 100.0% 131072.0 MB 491.2 MB/s 762.7 MB/s 5.776 ms http://docs.datastax.com/en/latest-dse/datastax_enterprise/tools/CFSstress.html?hl=cfs,stress Example output: Data Output Description progress Total progress of the stress operation bytes Total bytes written/read curr rate Current rate of bytes being written/read per second avg rate Average rate of bytes being written/read per second max latency Maximum latency in milliseconds during the current reporting window $ cfs-stress [options] filesystem_directory Usage: Short form Long form Description -s --size Size of each file in KB. Default 1024. -n --count Total number of files read/written. Default 100. -t --threads Number of threads. -r --streams Maximum number of streams kept open per thread. Default 2. -o --operation (R)ead, (W)rite, (WR) write & read, (WRD) write, read, delete Options (truncated):
  • 8. © DataStax, All Rights Reserved. We can deploy multiple DSE Filesystems across multiple data centers Each datacenter should maintain a separate instance of DSEFS. 8 https://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/configUsingDsefs.html#configUsingDsefs__configMultiDC dsefs_options: ... keyspace_name: dsefs1 ALTER KEYSPACE dsefs1 WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC1': '3' }; dsefs_options: ... keyspace_name: dsefs2 ALTER KEYSPACE dsefs2 WITH replication = { 'class': 'NetworkTopologyStrategy', 'DC2': '3' }; 1. In the dse.yaml files, specify a separate DSEFS keyspace for each data center. On each DC1 node: On each DC2 node: 2. Set the keyspace replication appropriately for each datacenter. On each DC1 node: On each DC2 node:
  • 9. © DataStax, All Rights Reserved. DSE Filesystem supports JBOD • Data directories can be setup with JBOD (Just a bunch of disks) • storage_weight controls how much data is stored in each directory relative to other directories • Directories should be mounted on different physical devices • If a server fails, we can move the storage to another node • Designed for efficiency: • Does not use JVM heap for FS data • Uses async I/O and Cassandra queries • Uses thread-per-core threading model 9 DSE File System Options (dse.yaml) dsefs_options:
 enabled: true
 
 ... 
 data_directories:
 - dir: /mnt/cass_data_disks/dsefs_data1
 # How much data should be placed in this directory relative to # other directories in the cluster
 storage_weight: 1.0
 ... - dir: /mnt/cass_data_disks/dsefs_data2
 # How much data should be placed in this directory relative to # other directories in the cluster
 storage_weight: 5.0
 ...
  • 10. © DataStax, All Rights Reserved. We use an email messaging app to test fault-tolerance 10 CREATE TABLE email_msg_tracker ( msg_id text, tenant_id uuid, mailbox_id uuid, time_delivered timestamp, time_forwarded timestamp, time_read timestamp, time_replied timestamp, PRIMARY KEY ((msg_id, tenant_id), mailbox_id) ) Data Model https://github.com/datastax/code-samples/tree/master/ Application Setup data
  • 11. © DataStax, All Rights Reserved. Enabling fault-tolerance in Spark is simple 1. On startup for the first time, it will create a new StreamingContext. 2. Within our context creation method we configure checkpointing with ssc.checkpoint(path). 3. We will also define our DStream in this function. 4. When the program is restarted after failure, it will re-create a StreamingContext from the checkpoint. 11 Spark Streaming Application Setup Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start StreamingContext NO YES
  • 12. © DataStax, All Rights Reserved. Enabling the DSE Filesystem is even easier Configurations: 1. Enable DSEFS 2. Keyspace name for storing metadata • metadata is stored in Cassandra • no SPOF, always-on, scalable, durable, etc. 3. Working directory • data is tiny, no special throughput, latency or capacity requirements 4. Data directory • JBOD support • Designed for efficiency 12 4 3 2 1 DSE File System Options (dse.yaml) dsefs_options:
 enabled: true
 
 # Keyspace for storing the DSE FS metadata
 keyspace_name: dsefs
 
 # The local directory for storing node-local metadata
 # The work directory must not be shared by DSE FS nodes.
 work_dir: /mnt/data_disks/data1/dse/dsefs
 . . . 
 data_directories:
 - dir: /var/lib/dsefs/data
 # How much data should be placed in this directory relative to # other directories in the cluster
 storage_weight: 1.0
 # Reserved space (in bytes) that is not going to be used for # storing blocks
 min_free_space: 5368709120
  • 13. © DataStax, All Rights Reserved. Checkpoint files are managed automatically These files are automatically managed by our application and DataStax Filesystem (DSEFS). Under the hood, these are some files we expect to see: • checkpoint-1471492965000: serialized checkpoint file containing data used for recovery • a2f2452d-3608-43b9-ba60-92725b54399a: directory used for storing Write-Ahead-Log data • receivedBlockMetadata: used for Write-Ahead-Logging to store received block metadata 13 # Start streaming application with DSEFS URI and checkpoint dir $ dse spark-submit —class sparkAtScale.StreamingDirectEmails path/to/myjar/streaming-assembly-0.1.jar dsefs:///emails_checkpoint $ dse fs dsefs / > ls -l emails_checkpoint Type ... Modified Name dir ... 2016-08-17 21:02:40-0700 a2f2452d-3608-43b9-ba60-92725b54399a dir ... 2016-08-17 21:02:50-0700 receivedBlockMetadata file ... 2016-08-17 21:02:49-0700 checkpoint-1471492965000
  • 14. © DataStax, All Rights Reserved. DSE Filesystem stands up real world challenges DSE Filesystem demonstrated stable perfomance. In this experiment we tested a 3-node cluster streaming data from Kafka. The summary on the right shows stable input rate and processing time. 14 Metadata checkpointing (Kafka Direct API) • DSEFS Rep. Factor: 3 • Kafka/Spark partitions: 306 • Spark cores/node: 34 • Spark Executor Mem: 2G Streaming • Rate: ~67K events/sec • Batch Interval: 10 sec • Total Delay: 5.5 sec Driver Executo Spark Streaming C* DSEF Kafka ExecutoC* DSEF Spark ExecutoC* DSEF Spark Cluster Setup Streaming Statistics
  • 15. © DataStax, All Rights Reserved. Tuning Spark Streaming in a nutshell Tuning Spark Streaming boils down to balancing the input rate and processing throughput. At a high level, we need to consider two things: 1. Reducing the processing time of each batch 2. Setting the right batch size so batches can be processed as fast as they are received What are some ways we can accomplish these goals? 15
  • 16. © DataStax, All Rights Reserved. Reduce processing time by increasing parallelism 16 Exaggerated example obviously, main point: Proper parallelism can achieve better performance and stability. Processing Time: The time to process each batch of data. Both experiments had the same configs: • input rate: ~85K events/sec • batch interval: 5 sec • maxRatePerPartition: 100K Experiment #2 has a better: • workload distribution • lower processing time and delays • processing time remains below the batch interval • overall better stability If we see Exp. #1 behavior even with parallelism set properly, we need to exam CPU utilization more closely. Experiment #1: 1 Spark partition Experiment #2: 100 Spark partitions Spark Streaming Statistics
  • 17. © DataStax, All Rights Reserved. Reduce processing time by increasing parallelism 17 Spark Streaming Statistics Exaggerated example obviously, main point: Proper parallelism can achieve better performance and stability. Scheduling Delay: the time a batch waits in a queue for the processing of previous batches to finish. Both experiments had the same configs: • input rate: ~85K events/sec • batch interval: 5 sec • maxRatePerPartition: 100K Experiment #2 has a better: • workload distribution • lower processing time and delays • processing time remains below the batch interval • overall better stability If we see Exp. #1 behavior even with parallelism set properly, we need to exam CPU utilization more closely. Experiment #1: 1 Spark partition Experiment #2: 100 Spark partitions
  • 18. © DataStax, All Rights Reserved. Reduce processing time by increasing parallelism 18 Spark Streaming Statistics Exaggerated example obviously, main point: Proper parallelism can achieve better performance and stability. Total Delay: Scheduling Delay + Processing Time Both experiments had the same configs: • input rate: ~85K events/sec • batch interval: 5 sec • maxRatePerPartition: 100K Experiment #2 has a better: • workload distribution • lower processing time and delays • processing time remains below the batch interval • overall better stability If we see Exp. #1 behavior even with parallelism set properly, we need to exam CPU utilization more closely. Experiment #1: 1 Spark partition Experiment #2: 100 Spark partitions
  • 19. © DataStax, All Rights Reserved. Setting the right batch interval depends on the setup 19 Experiment #2 is a good example of stability: • Batch Interval: 5 sec • Avg. Processing Time: 3.3 sec • Avg. Scheduling Delay: 135 ms • Avg. Total Delay: 3.4 sec • Minimal scheduling delay • Consistent processing time below batch interval A good approach is to test with a conservative batch interval (say, 5-10s) and a low data input rate. If the processing time < batch interval, then the system is stable. If the processing time > batch interval OR scheduling delay increases, then reduce the processing time. Spark Streaming Statistics Experiment #2: 100 Spark partitions
  • 20. © DataStax, All Rights Reserved. Consider this scenario: The streaming app is ingesting data and checkpointing without issues, but when attempting to recover from a saved checkpoint the following error occurs. What does this mean? Let’s look at our setup. 20 ERROR … org.apache.spark.streaming.StreamingContext: Error starting the context, marking it as stopped org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@45984654 has not been initialized at scala.Option.orElse(Option.scala:257) ~[scala-library-2.10.5.jar:na] at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339) … Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@45984654 has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321) … How to implement checkpointing correctly
  • 21. © DataStax, All Rights Reserved. 21 Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start How to implement checkpointing correctly Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start Setup BSetup A Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized Hint: In red shows where the input stream is created, notice the different location between examples. NO YES NO YES
  • 22. © DataStax, All Rights Reserved. 22 Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start How to implement checkpointing correctly Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start Setup BSetup A Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized Hint: In red shows where the input stream is created, notice the different location between examples. NO YES NO YES We want the stream to be recoverable, so we need to move this into createStreamingContext(). The problem here is that we are creating the stream in the wrong place.
  • 23. © DataStax, All Rights Reserved. 23 Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start How to implement checkpointing correctly Define DStream Set Checkpoint Path Create createStreamingContext() Checkpoint Recover Checkpoint Start Start Setup BSetup A Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized Hint: In red shows where the input stream is created, notice the different location between examples. NO YES NO YES We want just about everything in this function.
  • 24. © DataStax, All Rights Reserved. 24 How to implement checkpointing correctly Code Sample B def main(args: Array[String]) {
 
 val conf = new SparkConf() // with addition settings as needed
 val sc = SparkContext.getOrCreate(conf) val sqlContext = SQLContext.getOrCreate(sc)
 import sqlContext.implicits._ def createStreamingContext(): StreamingContext = {
 val newSsc = new StreamingContext(sc, Seconds(batchInterval))
 newSsc.checkpoint(checkpoint_path)
 newSsc
 }
 
 val ssc = StreamingContext.getActiveOrCreate(checkpoint_path, createStreamingContext)
 . . .
 val emailsStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
 
 . . . 
 ssc.start()
 ssc.awaitTermination()
 } Code Sample A Error starting the context… o.a.s.SparkException: o.a.s.s.kafka.DirectKafkaInputDStream@45984654 has not been initialized Hint: In red shows where the input stream is created, notice the different location between examples. : Highlights the method for creating the streaming context. https://github.com/datastax/code-samples/tree/master/ def main(args: Array[String]) {
 
 val conf = new SparkConf() // with addition settings as needed
 val sc = SparkContext.getOrCreate(conf) val sqlContext = SQLContext.getOrCreate(sc)
 import sqlContext.implicits._ def createStreamingContext(): StreamingContext = {
 val newSsc = new StreamingContext(sc, Seconds(batchInterval))
 newSsc.checkpoint(checkpoint_path) . . .
 val emailsStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) 
 . . . newSsc
 }
 
 val ssc = StreamingContext.getActiveOrCreate(checkpoint_path, createStreamingContext)
 
 ssc.start()
 ssc.awaitTermination()
 } Code Sample B
  • 25. © DataStax, All Rights Reserved. General tips to keep things running smoothly • One DSE node can run at most one DSEFS server. • In production, always set RF >= 3 for the DSEFS keyspace. • For optimal performance, organize local DSEFS data on different physical drives than the Cassandra database. • Currently DSEFS is intended for Spark streaming use cases, support for broader use cases is underway. 25 http://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/aboutDsefs.html?hl=dsefs
  • 26. © DataStax, All Rights Reserved. DSE Filesystem supports all Spark fault-tolerant goals 1.DataStax Enterprise Filesystem (DSEFS) can support all our needs for Spark Streaming fault-tolerance (Metadata checkpointing, Write-ahead-logging, RDD checkpointing). 2.Setting up the DSEFS is very straight forward, and provides common utilities for managing data. 3.We covered some details to consider when build a Spark Streaming App. 4.We can use existing tools such as cfs-stress to benchmark I/O before deployment. 5.Many more features are currently underway for upcoming releases of DSEFS: • Enhanced security • Hardening for long-term storage: fsck improvements, checksumming • Performance improvements: Compression, data locality improvements and more • Access to other types of filesystems: HDFS, local file system from the DSEFS shell • User-experience improvements in the DSEFS shell: tab-completion, wildcards, etc. 26
  • 27. Thank You • Piotr Kołaczkowski (DSEFS Architect and Lead Developer) • Brian Hess (Product Management) • Jarosław Grabowski (DSEFS Developer) • Russell Spitzer (DSE Analytics Developer) • Ryan Knight (Solutions Engineer) • Giovanny Castro (DSE Test Engineer) • DSE Analytics Team https://github.com/datastax/code-samples/tree/master/