Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

Version 1.0
Machine Learning with Spark + Cassandra
An Anant Corporation Story.
Using Spark and Cassandra for basic Machine Learning

What is Machine Learning?
● Machine learning is “is the study of computer algorithms that improve automatically through
experience and by the use of data”
○ Seen as a part of artificial intelligence.
○ Used in a large number of applications in today’s society in many different fields.
■ I.E. Medicine (drug discovery), image recognition, email filtering, ...
● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear
algebra), making it appear intimidating to many beginners
○ However, many open-source and easy to access tools exist for working with and using machine learning in
the real world. Some examples include:
■ TensorFlow: https://www.tensorflow.org/
■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html
○ Additionally, many well constructed resources exist for learning machine learning online
■ Basics ML courses
■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures:
https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A

Machine Learning: Neat Examples
● Image Classification - attempting to get the system to be able to classify pictures as being part of
some category
○ Popular example in many beginner machine learning courses - classifying images between being a picture of
a dog or being a picture of a cat
■ Well known competition on Kaggle + very large dataset to use if you are interested in trying:
https://www.kaggle.com/c/dogs-vs-cats
○ Another popular example - determining what digit is written in a picture (0-9).
■ Popular dataset: MNIST Dataset, includes 60,000 examples as part of a training set and 10,000
examples as part of a testing set. http://yann.lecun.com/exdb/mnist/
● Sentiment Analysis - attempting to determine whether a given text (i.e. a tweet from Twitter) is
generally “positive” or “negative”.
○ Well known dataset called “sentiment140”, containing 1.6 million tweets rated between 0-4 (negative to
positive): https://www.kaggle.com/kazanova/sentiment140

Machine Learning: 4 stages of processing data
● Generally, there are four stages to a machine
learning task:
a. Preparation of data
■ Converting Raw data into data that
may be labeled and have
features/components separated
b. Splitting of data
■ A portion of your data needs to be
used for training your model, a part
for validating that the training is
“possibly okay” and then a portion for
testing your model on
c. Training your model
d. Obtaining predictions using the trained
model

Machine Learning: Supervised vs Unsupervised
● Supervised Learning: Training a machine learning model with a dataset that has labels attached to
it.
○ I.E. Image classification of dogs vs cats, digits, and many other examples
● Unsupervised Learning: A machine learning model is given a dataset without explicit labels on the
data, thus without explicit instructions on what exactly to look for
○ Used to find unknown trends or groupings in data
○ Example of the kinds of things that unsupervised learning can look for: patterns in images
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/

Cassandra
● Apache Cassandra is an open-source distributed No-
SQL database designed to handle large volumes of data
across multiple different servers
● Cassandra clusters can be upgraded by either
improving hardware on current nodes (vertical
scalability) or adding more nodes (horizontal
scalability)
○ Horizontal scalability is part of why Cassandra is so
powerful - cheap machines can be added to a cluster to
improve its performance in a significant manner
● Note: Demo runs code from DataStax, and they use the
DSE (DataStax Enterprise) version of Cassandra (not
open source)
○ Spark is also run together with Cassandra in DSE

Spark
● Apache Spark:
○ Open-Source unified analytics engine for large-scale
data processing
○ Provides an interface for programmingentire clusters
with implicit data parallelism and fault tolerance
● Spark Machine Learning Library (MLlib)
○ Spark’s machine learning library
○ Features many common ML algorithms, tools for
modifying data and pipelines, loading/saving
algorithms, various utilities and more.
○ Primary API used to be RDD (Resilient Distributed
Datasets) based, but has switched over to being
DataFrame based since Spark 2.0

Jupyter
● Jupyter is an open-source web application for creating and
sharing documents containing code, equations, markdown
text and more.
○ Website for Project Jupyter: https://jupyter.org/
● Very popular amongst data scientists
● Github repo which will be demoed in this presentation
contains a few different Jupyter notebooks containing
sections of Python code along with explanations/narration
of the general idea of the code
● Jupyter notebooks can be used with programming languages
other than python, including R, Julia, Scala and many others.

Demo Project Slide
● Link to Github Repo: https://github.com/HadesArchitect/CaSpark
● Contains a Docker compose file which will run three docker images:
○ DSE - DataStax Enterprise version of Cassandra with Spark
○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook
○ DataStax Studio
● Github Repo contains a lot of example files, but we will be primarily looking at KMeans.ipynb
● Note: Need to make a change to the docker-compose.yml file before running docker-compose
command, changing the line dealing with PYSPARK_SUBMIT_ARGS
○ Need to use a more up to date version of the DataStax Spark Cassandra Connector
○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf
spark.cassandra.connection.host=dse pyspark-shell'

Resources
● https://www.tensorflow.org/
● http://spark.apache.org/docs/latest/ml-guide.html
● https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
● https://www.kaggle.com/c/dogs-vs-cats
● https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
● https://blog.anant.us/spark-and-cassandra-for-machine-learning-setup/
● https://github.com/HadesArchitect/CaSpark
● Previous Anant Webinar on Youtube: https://www.youtube.com/watch?v=ahqWq6Gkwbw

Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

Mais conteúdo relacionado

Mais procurados

Semelhante a Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra

Mais de Anant Corporation

Último

Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra