Cassandra is a distributed database that was designed for fault-tolerance, high availability, and scalability.
It is an open source project that was initially created by Facebook to power their inbox search feature.
It became an Apache project in 2009.
Cassandra was designed to support large amounts of data across any number of servers
In a Cassandra ring, data is replicated across multiple nodes so that there is no single point of failure
data is distributed in such a way that each node contains a different set of data
Every node in a Cassandra ring has the same role
There is no concept of a Master
And any node can respond to any request
Assuming sufficient redundancy, failed nodes can be replaced without downtime or interruption of service
New nodes can also be added as needed without interruption
Adding nodes can improve both read-and-write throughput
Cassandra uses the Gossip protocol for inter-node communication.
There is no central broadcaster but information is eventually transferred to all nodes
The CAP theorem says that out of Consistency, Availability, and Partition Tolerance, you can only choose 2.
In general, Cassandra prioritizes Availability and Partition Tolerance with an eventually Consistent model
It does actually support tunable consistency so there are some knobs and levers you can use to tune it
The Cassandra Mesos framework automates bootstrapping and (some) operational tasks for Cassandra clusters
A minimum of one seed node must be running and healthy
Then it adds additional nodes until it reaches the desired capacity
You can configure the number of nodes and the amount of resources, such as CPU and memory, that are needed
The framework automates some operational tasks such as running ‘nodetool repair’
The framework has the ability to restart, remove, and replace nodes on failure.
Finally, the framework is self-contained
You do not have to pre-install or configure anything on your agent or slave nodes
Spark is a data processing platform that is designed to be fast and general-purpose
It is often used a replacement for the Hadoop MapReduce model but also supports other workloads like
batch applications
interactive SQL queries
streaming data and
machine learning
Spark was originally developed at UC Berkely in 2009
it actually was built as a way to show how easy it is to build a framework on top of Mesos
One of Spark’s biggest benefits is its ability to run computations in-memory as well as on-disk
At a high level, Spark’s architecture is made up of 3 components:
Data storage
API
Resource management
Spark supports creating distributed datasets from any Hadoop-compatible data source like
HDFS
HBase
S3
Cassandra
Spark’s functionality is exposed in simple APIs in Python, Scala, Java, and R
Spark is designed to scale up as needed and supports a few different cluster managers including
a standalone Spark scheduler
Yarn
and, of course, Mesos
Spark runs as a Mesos framework
One of the huge benefits of running Spark on Mesos is that it allows you to share your cluster with other applications and services.
You do not have to build a dedicated cluster just for your Spark tasks
Multiple frameworks can coexist on the same cluster
The Spark framework supports the Docker containerizer.
This means you can run Spark tasks from a docker container
Otherwise, you would need to either package Spark and host it in HDFS, or at an accessible http or s3 url, or
you would need to pre-install Spark on all of your mesos agents
2-level scheduling in Mesos allows for sophisticated scheduling scenarios
When Mesos makes a resource offer to Spark, Spark is able to analyze it and decide whether or not to accept the resources
If Spark has no tasks to run, it will reject the offer
If it does have a task to run and the resource offer is sufficient, it can accept the offer and launch its task
The Spark framework supports two different run modes
Coarse-grained
Fine-grained
In Coarse-grained mode, the framework launches long-running Spark executors on Mesos agent nodes
This allows new tasks to start very quickly since there is already an executor running
Coarse-grained mode is great for interactive Spark sessions
Coarse-grained mode can be less efficient
Resources are tied up by the long running executors even when no tasks are running
In fine-grained mode, a new executor is launched per task
this obviously increases the time it takes to launch a new task
however, this can be a more efficient use of resources
The resources will only be used while tasks are running
If no tasks are running, those resources are available for other frameworks to use
Docker is a container virtualization platform
It allows you to package an application, along with all of its dependencies, into an isolated container that you can run on any relatively modern linux system
It achieves isolation by taking advantage of underlying linux technologies like cgroups and namespaces
Containers are often compared to virtual machines
They are smaller, more lightweight, more portable, and easier to deploy.
Docker can simplify the development lifecycle by allowing you to use the same artifact, the container, during development, in your continuous integration and testing pipeline, and on your production systems.
By having your application and all of its dependencies packaged in a container, you have much less risk of running into compatibility problems as you move your application from development to testing to production
While Docker is not required in a Mesos environment, it can really simplify things as your agent nodes won’t need a lot of specialized software installed.
Let’s say you were going to deploy a Ruby app on your mesos cluster,
If you were not using the Docker containerizer, you would have to ensure that all of your agents had the right version of ruby installed and configured for your app
If you had multiple applications that required different versions of ruby... well, you probably see where this is going. it can get really painful, really quickly
By using Docker containers, you wouldn’t even need to install ruby on your mesos agents as that dependency would be isolated in the container you deploy