Version 1.0
Machine Learning with Spark + Cassandra
An Anant Corporation Story.
Using Spark and Cassandra for basic Machine Learning
What is Machine Learning?
● Machine learning is “is the study of computer algorithms that improve automatically through
experience and by the use of data”
○ Seen as a part of artificial intelligence.
○ Used in a large number of applications in today’s society in many different fields.
■ I.E. Medicine (drug discovery), image recognition, email filtering, ...
● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear
algebra), making it appear intimidating to many beginners
○ However, many open-source and easy to access tools exist for working with and using machine learning in
the real world. Some examples include:
■ TensorFlow: https://www.tensorflow.org/
■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html
○ Additionally, many well constructed resources exist for learning machine learning online
■ Basics ML courses
■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures:
https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
Machine Learning: Neat Examples
● Image Classification - attempting to get the system to be able to classify pictures as being part of
some category
○ Popular example in many beginner machine learning courses - classifying images between being a picture of
a dog or being a picture of a cat
■ Well known competition on Kaggle + very large dataset to use if you are interested in trying:
https://www.kaggle.com/c/dogs-vs-cats
○ Another popular example - determining what digit is written in a picture (0-9).
■ Popular dataset: MNIST Dataset, includes 60,000 examples as part of a training set and 10,000
examples as part of a testing set. http://yann.lecun.com/exdb/mnist/
● Sentiment Analysis - attempting to determine whether a given text (i.e. a tweet from Twitter) is
generally “positive” or “negative”.
○ Well known dataset called “sentiment140”, containing 1.6 million tweets rated between 0-4 (negative to
positive): https://www.kaggle.com/kazanova/sentiment140
Machine Learning: 4 stages of processing data
● Generally, there are four stages to a machine
learning task:
a. Preparation of data
■ Converting Raw data into data that
may be labeled and have
features/components separated
b. Splitting of data
■ A portion of your data needs to be
used for training your model, a part
for validating that the training is
“possibly okay” and then a portion for
testing your model on
c. Training your model
d. Obtaining predictions using the trained
model
Machine Learning: Supervised vs Unsupervised
● Supervised Learning: Training a machine learning model with a dataset that has labels attached to
it.
○ I.E. Image classification of dogs vs cats, digits, and many other examples
● Unsupervised Learning: A machine learning model is given a dataset without explicit labels on the
data, thus without explicit instructions on what exactly to look for
○ Used to find unknown trends or groupings in data
○ Example of the kinds of things that unsupervised learning can look for: patterns in images
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
Cassandra
● Apache Cassandra is an open-source distributed No-
SQL database designed to handle large volumes of data
across multiple different servers
● Cassandra clusters can be upgraded by either
improving hardware on current nodes (vertical
scalability) or adding more nodes (horizontal
scalability)
○ Horizontal scalability is part of why Cassandra is so
powerful - cheap machines can be added to a cluster to
improve its performance in a significant manner
● Note: Demo runs code from DataStax, and they use the
DSE (DataStax Enterprise) version of Cassandra (not
open source)
○ Spark is also run together with Cassandra in DSE
Spark
● Apache Spark:
○ Open-Source unified analytics engine for large-scale
data processing
○ Provides an interface for programmingentire clusters
with implicit data parallelism and fault tolerance
● Spark Machine Learning Library (MLlib)
○ Spark’s machine learning library
○ Features many common ML algorithms, tools for
modifying data and pipelines, loading/saving
algorithms, various utilities and more.
○ Primary API used to be RDD (Resilient Distributed
Datasets) based, but has switched over to being
DataFrame based since Spark 2.0
Jupyter
● Jupyter is an open-source web application for creating and
sharing documents containing code, equations, markdown
text and more.
○ Website for Project Jupyter: https://jupyter.org/
● Very popular amongst data scientists
● Github repo which will be demoed in this presentation
contains a few different Jupyter notebooks containing
sections of Python code along with explanations/narration
of the general idea of the code
● Jupyter notebooks can be used with programming languages
other than python, including R, Julia, Scala and many others.
Demo Project Slide
● Link to Github Repo: https://github.com/HadesArchitect/CaSpark
● Contains a Docker compose file which will run three docker images:
○ DSE - DataStax Enterprise version of Cassandra with Spark
○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook
○ DataStax Studio
● Github Repo contains a lot of example files, but we will be primarily looking at KMeans.ipynb
● Note: Need to make a change to the docker-compose.yml file before running docker-compose
command, changing the line dealing with PYSPARK_SUBMIT_ARGS
○ Need to use a more up to date version of the DataStax Spark Cassandra Connector
○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf
spark.cassandra.connection.host=dse pyspark-shell'
Resources
● https://www.tensorflow.org/
● http://spark.apache.org/docs/latest/ml-guide.html
● https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
● https://www.kaggle.com/c/dogs-vs-cats
● https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
● https://blog.anant.us/spark-and-cassandra-for-machine-learning-setup/
● https://github.com/HadesArchitect/CaSpark
● Previous Anant Webinar on Youtube: https://www.youtube.com/watch?v=ahqWq6Gkwbw
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

  • 1.
    Version 1.0 Machine Learningwith Spark + Cassandra An Anant Corporation Story. Using Spark and Cassandra for basic Machine Learning
  • 2.
    What is MachineLearning? ● Machine learning is “is the study of computer algorithms that improve automatically through experience and by the use of data” ○ Seen as a part of artificial intelligence. ○ Used in a large number of applications in today’s society in many different fields. ■ I.E. Medicine (drug discovery), image recognition, email filtering, ... ● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners ○ However, many open-source and easy to access tools exist for working with and using machine learning in the real world. Some examples include: ■ TensorFlow: https://www.tensorflow.org/ ■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html ○ Additionally, many well constructed resources exist for learning machine learning online ■ Basics ML courses ■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
  • 3.
    Machine Learning: NeatExamples ● Image Classification - attempting to get the system to be able to classify pictures as being part of some category ○ Popular example in many beginner machine learning courses - classifying images between being a picture of a dog or being a picture of a cat ■ Well known competition on Kaggle + very large dataset to use if you are interested in trying: https://www.kaggle.com/c/dogs-vs-cats ○ Another popular example - determining what digit is written in a picture (0-9). ■ Popular dataset: MNIST Dataset, includes 60,000 examples as part of a training set and 10,000 examples as part of a testing set. http://yann.lecun.com/exdb/mnist/ ● Sentiment Analysis - attempting to determine whether a given text (i.e. a tweet from Twitter) is generally “positive” or “negative”. ○ Well known dataset called “sentiment140”, containing 1.6 million tweets rated between 0-4 (negative to positive): https://www.kaggle.com/kazanova/sentiment140
  • 4.
    Machine Learning: 4stages of processing data ● Generally, there are four stages to a machine learning task: a. Preparation of data ■ Converting Raw data into data that may be labeled and have features/components separated b. Splitting of data ■ A portion of your data needs to be used for training your model, a part for validating that the training is “possibly okay” and then a portion for testing your model on c. Training your model d. Obtaining predictions using the trained model
  • 5.
    Machine Learning: Supervisedvs Unsupervised ● Supervised Learning: Training a machine learning model with a dataset that has labels attached to it. ○ I.E. Image classification of dogs vs cats, digits, and many other examples ● Unsupervised Learning: A machine learning model is given a dataset without explicit labels on the data, thus without explicit instructions on what exactly to look for ○ Used to find unknown trends or groupings in data ○ Example of the kinds of things that unsupervised learning can look for: patterns in images https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
  • 6.
    Cassandra ● Apache Cassandrais an open-source distributed No- SQL database designed to handle large volumes of data across multiple different servers ● Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability) ○ Horizontal scalability is part of why Cassandra is so powerful - cheap machines can be added to a cluster to improve its performance in a significant manner ● Note: Demo runs code from DataStax, and they use the DSE (DataStax Enterprise) version of Cassandra (not open source) ○ Spark is also run together with Cassandra in DSE
  • 7.
    Spark ● Apache Spark: ○Open-Source unified analytics engine for large-scale data processing ○ Provides an interface for programmingentire clusters with implicit data parallelism and fault tolerance ● Spark Machine Learning Library (MLlib) ○ Spark’s machine learning library ○ Features many common ML algorithms, tools for modifying data and pipelines, loading/saving algorithms, various utilities and more. ○ Primary API used to be RDD (Resilient Distributed Datasets) based, but has switched over to being DataFrame based since Spark 2.0
  • 8.
    Jupyter ● Jupyter isan open-source web application for creating and sharing documents containing code, equations, markdown text and more. ○ Website for Project Jupyter: https://jupyter.org/ ● Very popular amongst data scientists ● Github repo which will be demoed in this presentation contains a few different Jupyter notebooks containing sections of Python code along with explanations/narration of the general idea of the code ● Jupyter notebooks can be used with programming languages other than python, including R, Julia, Scala and many others.
  • 9.
    Demo Project Slide ●Link to Github Repo: https://github.com/HadesArchitect/CaSpark ● Contains a Docker compose file which will run three docker images: ○ DSE - DataStax Enterprise version of Cassandra with Spark ○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook ○ DataStax Studio ● Github Repo contains a lot of example files, but we will be primarily looking at KMeans.ipynb ● Note: Need to make a change to the docker-compose.yml file before running docker-compose command, changing the line dealing with PYSPARK_SUBMIT_ARGS ○ Need to use a more up to date version of the DataStax Spark Cassandra Connector ○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
  • 10.
    Resources ● https://www.tensorflow.org/ ● http://spark.apache.org/docs/latest/ml-guide.html ●https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A ● https://www.kaggle.com/c/dogs-vs-cats ● https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/ ● https://blog.anant.us/spark-and-cassandra-for-machine-learning-setup/ ● https://github.com/HadesArchitect/CaSpark ● Previous Anant Webinar on Youtube: https://www.youtube.com/watch?v=ahqWq6Gkwbw
  • 11.
    Strategy: Scalable FastData Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037