Version 1.0
Apache Cassandra Lunch #54: Machine
Learning with Spark + Cassandra Part 2
An Anant Corporation Story.
What is Machine Learning?
● Machine learning is “is the study of computer algorithms that improve automatically through
experience and by the use of data”
○ Seen as a part of artificial intelligence.
○ Used in a large number of applications in today’s society in many different fields.
■ I.E. Medicine (drug discovery), image recognition, email filtering, ...
● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear
algebra), making it appear intimidating to many beginners
○ However, many open-source and easy to access tools exist for working with and using machine learning in
the real world. Some examples include:
■ TensorFlow: https://www.tensorflow.org/
■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html
○ Additionally, many well constructed resources exist for learning machine learning online
■ Basics ML courses
■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures:
https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
Machine Learning: 4 stages of processing data
● Generally, there are four stages to a machine
learning task:
a. Preparation of data
■ Converting Raw data into data that
may be labeled and have
features/components separated
b. Splitting of data
■ A portion of your data needs to be
used for training your model, a part
for validating that the training is
“possibly okay” and then a portion for
testing your model on
c. Training your model
d. Obtaining predictions using the trained
model
Machine Learning - why Cassandra?
● Speed/Reliability:
○ Machine learning projects tend to involve processing very large amounts of data in order to properly train
complicated models (millions of points of data are necessary for high accuracy in some applications)
■ Cassandra is very good with dealing with large amounts of incoming data.
○ Additionally, Cassandra is built upon with a masterless architecture - there are no “master” nodes, and thus
there is no single point of failure.
■ Data in Cassandra is replicated and multiple copies are stored across >1 nodes, ensuring the
availability of data even when nodes drop from the cluster or unexpectedly shut down.
● Scalability/Upgradability:
○ Cassandra can be scaled both horizontally and vertically:
■ Horizontal scalability allows the addition of many weaker machines as opposed to upgrading to more
powerful hardware on a few machines, thus it is easy to scale up your performance with relatively
little cost
ML - Random Forest
● The building block of a Random Forest is a concept known as a Decision Tree:
○ Decisions Tree Learning is a predictive modelling approach used in statistics, data mining and machine
learning.
○ Given a set of data and labels to the data, we can build a decision tree to try and figure out how to separate
the data into classes while testing on various attributes of the data.
■ Each node represents an attribute we test on
ML - Random Forest cont.
● A random forest, as the name suggests, consists of a large group of decisions trees made with
different segments of the data.
● Suppose we have N pieces of data with M features. We arbitrarily select to make n decision trees
as part of the random forest.
○ For each initial input used to make the decision trees, we take a random sample of size N from the data but
with replacement (intentionally adding in some duplicate pieces of data)
■ Reasoning for this is quite complicated
○ Each decision tree also arbitrarily selects only m < M features of the data to make the decision tree based on.
■ This is known as Feature Randomness. It forces more variation amongst the trees, thus ultimately
lowering correlation across trees and producing more diversification.
○ A given piece of data is then run through/classified by all of the decision trees and whichever class is
predicted the most is the result of the prediction for the Random Forest model.
ML - Random Forest cont.
● The following diagram visually shows the process a single piece of data goes through in order to be
classified by a Random Forest classifier:
ML - Naive Bayes classifier
● The Naive Bayes classifier is a collection of simple probability based classification algorithms
based on Bayes’ Theorem:
● In a classification problem however, we are trying to obtain the probability that some object/thing
is a member of a particular class given some large number of known values. If there is plenty of
variation/possible values for the input data, then using Bayes’ theorem in complicated problems is
not very practical if we are basing our results on probability tables.
ML - Naive Bayes classifier cont...
● To resolve this issue, we make the “naive” assumptions of conditional independence: assume that
all features in x are mutually independent, conditional on the category Ck:
● After going through some math, we come to the following conclusion given this set of assumptions:
Cassandra
● Apache Cassandra is an open-source distributed No-
SQL database designed to handle large volumes of data
across multiple different servers
● Cassandra clusters can be upgraded by either
improving hardware on current nodes (vertical
scalability) or adding more nodes (horizontal
scalability)
○ Horizontal scalability is part of why Cassandra is so
powerful - cheap machines can be added to a cluster to
improve its performance in a significant manner
● Note: Demo runs code from DataStax, and they use the
DSE (DataStax Enterprise) version of Cassandra (not
open source)
○ Spark is also run together with Cassandra in DSE
Spark
● Apache Spark:
○ Open-Source unified analytics engine for large-scale
data processing
○ Provides an interface for programmingentire clusters
with implicit data parallelism and fault tolerance
● Spark Machine Learning Library (MLlib)
○ Spark’s machine learning library
○ Features many common ML algorithms, tools for
modifying data and pipelines, loading/saving
algorithms, various utilities and more.
○ Primary API used to be RDD (Resilient Distributed
Datasets) based, but has switched over to being
DataFrame based since Spark 2.0
Jupyter
● Jupyter is an open-source web application for creating and
sharing documents containing code, equations, markdown
text and more.
○ Website for Project Jupyter: https://jupyter.org/
● Very popular amongst data scientists
● Github repo which will be demoed in this presentation
contains a few different Jupyter notebooks containing
sections of Python code along with explanations/narration
of the general idea of the code
● Jupyter notebooks can be used with programming languages
other than python, including R, Julia, Scala and many others.
Demo Project Slide
● Link to Github Repo: https://github.com/HadesArchitect/CaSpark
● Contains a Docker compose file which will run three docker images:
○ DSE - DataStax Enterprise version of Cassandra with Spark
○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook
○ DataStax Studio
● Github Repo contains a lot of example files, but we will be primarily looking at two examples
whose theory we discussed in this presentation:
○ Random Forest classification (Random Forest.ipynb)
○ Naive Bayes classification (Naivebayes.ipynb)
● Note: Need to make a change to the docker-compose.yml file before running docker-compose
command, changing the line dealing with PYSPARK_SUBMIT_ARGS
○ Need to use a more up to date version of the DataStax Spark Cassandra Connector
○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf
spark.cassandra.connection.host=dse pyspark-shell'
Resources
● https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/
● https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark-cassandra/
● https://towardsdatascience.com/understanding-random-forest-58381e0602d2
● https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a-
random-forest-depend-on-the-number-of-pred/36183
● https://pandio.com/blog/five-ways-apache-cassandra-is-designed-to-support-machine-
learning-use-cases/
● https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead-
of-without-replacement
● https://en.wikipedia.org/wiki/Naive_Bayes_classifier
● https://www.geeksforgeeks.org/naive-bayes-classifiers/
● http://users.sussex.ac.uk/~christ/crs/ml/lec02b.html
● https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

  • 1.
    Version 1.0 Apache CassandraLunch #54: Machine Learning with Spark + Cassandra Part 2 An Anant Corporation Story.
  • 2.
    What is MachineLearning? ● Machine learning is “is the study of computer algorithms that improve automatically through experience and by the use of data” ○ Seen as a part of artificial intelligence. ○ Used in a large number of applications in today’s society in many different fields. ■ I.E. Medicine (drug discovery), image recognition, email filtering, ... ● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners ○ However, many open-source and easy to access tools exist for working with and using machine learning in the real world. Some examples include: ■ TensorFlow: https://www.tensorflow.org/ ■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html ○ Additionally, many well constructed resources exist for learning machine learning online ■ Basics ML courses ■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
  • 3.
    Machine Learning: 4stages of processing data ● Generally, there are four stages to a machine learning task: a. Preparation of data ■ Converting Raw data into data that may be labeled and have features/components separated b. Splitting of data ■ A portion of your data needs to be used for training your model, a part for validating that the training is “possibly okay” and then a portion for testing your model on c. Training your model d. Obtaining predictions using the trained model
  • 4.
    Machine Learning -why Cassandra? ● Speed/Reliability: ○ Machine learning projects tend to involve processing very large amounts of data in order to properly train complicated models (millions of points of data are necessary for high accuracy in some applications) ■ Cassandra is very good with dealing with large amounts of incoming data. ○ Additionally, Cassandra is built upon with a masterless architecture - there are no “master” nodes, and thus there is no single point of failure. ■ Data in Cassandra is replicated and multiple copies are stored across >1 nodes, ensuring the availability of data even when nodes drop from the cluster or unexpectedly shut down. ● Scalability/Upgradability: ○ Cassandra can be scaled both horizontally and vertically: ■ Horizontal scalability allows the addition of many weaker machines as opposed to upgrading to more powerful hardware on a few machines, thus it is easy to scale up your performance with relatively little cost
  • 5.
    ML - RandomForest ● The building block of a Random Forest is a concept known as a Decision Tree: ○ Decisions Tree Learning is a predictive modelling approach used in statistics, data mining and machine learning. ○ Given a set of data and labels to the data, we can build a decision tree to try and figure out how to separate the data into classes while testing on various attributes of the data. ■ Each node represents an attribute we test on
  • 6.
    ML - RandomForest cont. ● A random forest, as the name suggests, consists of a large group of decisions trees made with different segments of the data. ● Suppose we have N pieces of data with M features. We arbitrarily select to make n decision trees as part of the random forest. ○ For each initial input used to make the decision trees, we take a random sample of size N from the data but with replacement (intentionally adding in some duplicate pieces of data) ■ Reasoning for this is quite complicated ○ Each decision tree also arbitrarily selects only m < M features of the data to make the decision tree based on. ■ This is known as Feature Randomness. It forces more variation amongst the trees, thus ultimately lowering correlation across trees and producing more diversification. ○ A given piece of data is then run through/classified by all of the decision trees and whichever class is predicted the most is the result of the prediction for the Random Forest model.
  • 7.
    ML - RandomForest cont. ● The following diagram visually shows the process a single piece of data goes through in order to be classified by a Random Forest classifier:
  • 8.
    ML - NaiveBayes classifier ● The Naive Bayes classifier is a collection of simple probability based classification algorithms based on Bayes’ Theorem: ● In a classification problem however, we are trying to obtain the probability that some object/thing is a member of a particular class given some large number of known values. If there is plenty of variation/possible values for the input data, then using Bayes’ theorem in complicated problems is not very practical if we are basing our results on probability tables.
  • 9.
    ML - NaiveBayes classifier cont... ● To resolve this issue, we make the “naive” assumptions of conditional independence: assume that all features in x are mutually independent, conditional on the category Ck: ● After going through some math, we come to the following conclusion given this set of assumptions:
  • 10.
    Cassandra ● Apache Cassandrais an open-source distributed No- SQL database designed to handle large volumes of data across multiple different servers ● Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability) ○ Horizontal scalability is part of why Cassandra is so powerful - cheap machines can be added to a cluster to improve its performance in a significant manner ● Note: Demo runs code from DataStax, and they use the DSE (DataStax Enterprise) version of Cassandra (not open source) ○ Spark is also run together with Cassandra in DSE
  • 11.
    Spark ● Apache Spark: ○Open-Source unified analytics engine for large-scale data processing ○ Provides an interface for programmingentire clusters with implicit data parallelism and fault tolerance ● Spark Machine Learning Library (MLlib) ○ Spark’s machine learning library ○ Features many common ML algorithms, tools for modifying data and pipelines, loading/saving algorithms, various utilities and more. ○ Primary API used to be RDD (Resilient Distributed Datasets) based, but has switched over to being DataFrame based since Spark 2.0
  • 12.
    Jupyter ● Jupyter isan open-source web application for creating and sharing documents containing code, equations, markdown text and more. ○ Website for Project Jupyter: https://jupyter.org/ ● Very popular amongst data scientists ● Github repo which will be demoed in this presentation contains a few different Jupyter notebooks containing sections of Python code along with explanations/narration of the general idea of the code ● Jupyter notebooks can be used with programming languages other than python, including R, Julia, Scala and many others.
  • 13.
    Demo Project Slide ●Link to Github Repo: https://github.com/HadesArchitect/CaSpark ● Contains a Docker compose file which will run three docker images: ○ DSE - DataStax Enterprise version of Cassandra with Spark ○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook ○ DataStax Studio ● Github Repo contains a lot of example files, but we will be primarily looking at two examples whose theory we discussed in this presentation: ○ Random Forest classification (Random Forest.ipynb) ○ Naive Bayes classification (Naivebayes.ipynb) ● Note: Need to make a change to the docker-compose.yml file before running docker-compose command, changing the line dealing with PYSPARK_SUBMIT_ARGS ○ Need to use a more up to date version of the DataStax Spark Cassandra Connector ○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
  • 14.
    Resources ● https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/ ● https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark-cassandra/ ●https://towardsdatascience.com/understanding-random-forest-58381e0602d2 ● https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a- random-forest-depend-on-the-number-of-pred/36183 ● https://pandio.com/blog/five-ways-apache-cassandra-is-designed-to-support-machine- learning-use-cases/ ● https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead- of-without-replacement ● https://en.wikipedia.org/wiki/Naive_Bayes_classifier ● https://www.geeksforgeeks.org/naive-bayes-classifiers/ ● http://users.sussex.ac.uk/~christ/crs/ml/lec02b.html ● https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
  • 15.
    Strategy: Scalable FastData Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037