O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Cassandra is a distributed database that was designed for fault-tolerance, high availability, and scalability. It is an open source project that was initially created by Facebook to power their inbox search feature. It became an Apache project in 2009.
Cassandra was designed to support large amounts of data across any number of servers In a Cassandra ring, data is replicated across multiple nodes so that there is no single point of failure data is distributed in such a way that each node contains a different set of data Every node in a Cassandra ring has the same role There is no concept of a Master And any node can respond to any request Assuming sufficient redundancy, failed nodes can be replaced without downtime or interruption of service New nodes can also be added as needed without interruption Adding nodes can improve both read-and-write throughput
Cassandra uses the Gossip protocol for inter-node communication. There is no central broadcaster but information is eventually transferred to all nodes
The CAP theorem says that out of Consistency, Availability, and Partition Tolerance, you can only choose 2. In general, Cassandra prioritizes Availability and Partition Tolerance with an eventually Consistent model It does actually support tunable consistency so there are some knobs and levers you can use to tune it
The Cassandra Mesos framework automates bootstrapping and (some) operational tasks for Cassandra clusters A minimum of one seed node must be running and healthy Then it adds additional nodes until it reaches the desired capacity You can configure the number of nodes and the amount of resources, such as CPU and memory, that are needed The framework automates some operational tasks such as running ‘nodetool repair’ The framework has the ability to restart, remove, and replace nodes on failure. Finally, the framework is self-contained You do not have to pre-install or configure anything on your agent or slave nodes
Spark is a data processing platform that is designed to be fast and general-purpose It is often used a replacement for the Hadoop MapReduce model but also supports other workloads like batch applications interactive SQL queries streaming data and machine learning
Spark was originally developed at UC Berkely in 2009 it actually was built as a way to show how easy it is to build a framework on top of Mesos One of Spark’s biggest benefits is its ability to run computations in-memory as well as on-disk At a high level, Spark’s architecture is made up of 3 components: Data storage API Resource management Spark supports creating distributed datasets from any Hadoop-compatible data source like HDFS HBase S3 Cassandra Spark’s functionality is exposed in simple APIs in Python, Scala, Java, and R Spark is designed to scale up as needed and supports a few different cluster managers including a standalone Spark scheduler Yarn and, of course, Mesos
Spark runs as a Mesos framework One of the huge benefits of running Spark on Mesos is that it allows you to share your cluster with other applications and services. You do not have to build a dedicated cluster just for your Spark tasks Multiple frameworks can coexist on the same cluster The Spark framework supports the Docker containerizer. This means you can run Spark tasks from a docker container Otherwise, you would need to either package Spark and host it in HDFS, or at an accessible http or s3 url, or you would need to pre-install Spark on all of your mesos agents 2-level scheduling in Mesos allows for sophisticated scheduling scenarios When Mesos makes a resource offer to Spark, Spark is able to analyze it and decide whether or not to accept the resources If Spark has no tasks to run, it will reject the offer If it does have a task to run and the resource offer is sufficient, it can accept the offer and launch its task
The Spark framework supports two different run modes Coarse-grained Fine-grained In Coarse-grained mode, the framework launches long-running Spark executors on Mesos agent nodes This allows new tasks to start very quickly since there is already an executor running Coarse-grained mode is great for interactive Spark sessions Coarse-grained mode can be less efficient Resources are tied up by the long running executors even when no tasks are running In fine-grained mode, a new executor is launched per task this obviously increases the time it takes to launch a new task however, this can be a more efficient use of resources The resources will only be used while tasks are running If no tasks are running, those resources are available for other frameworks to use
Docker is a container virtualization platform It allows you to package an application, along with all of its dependencies, into an isolated container that you can run on any relatively modern linux system It achieves isolation by taking advantage of underlying linux technologies like cgroups and namespaces
Containers are often compared to virtual machines They are smaller, more lightweight, more portable, and easier to deploy. Docker can simplify the development lifecycle by allowing you to use the same artifact, the container, during development, in your continuous integration and testing pipeline, and on your production systems. By having your application and all of its dependencies packaged in a container, you have much less risk of running into compatibility problems as you move your application from development to testing to production
While Docker is not required in a Mesos environment, it can really simplify things as your agent nodes won’t need a lot of specialized software installed. Let’s say you were going to deploy a Ruby app on your mesos cluster, If you were not using the Docker containerizer, you would have to ensure that all of your agents had the right version of ruby installed and configured for your app If you had multiple applications that required different versions of ruby... well, you probably see where this is going. it can get really painful, really quickly By using Docker containers, you wouldn’t even need to install ruby on your mesos agents as that dependency would be isolated in the container you deploy