O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 15 Anúncio

Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

Baixar para ler offline

In Apache Cassandra Lunch #54, we will discuss how you can use Apache Spark and Apache Cassandra to perform additional basic Machine Learning tasks.

Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-54-machine-learning-with-spark--cassandra-part-2/

Accompanying YouTube Video: https://youtu.be/3roCSBWQzRk

Sign Up For Our Newsletter: http://eepurl.com/grdMkn

Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/

Cassandra.Link:
https://cassandra.link/

Follow Us and Reach Us At:

Anant:
https://www.anant.us/

Awesome Cassandra:
https://github.com/Anant/awesome-cassandra

Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch

Email:
solutions@anant.us

LinkedIn:
https://www.linkedin.com/company/anant/

Twitter:
https://twitter.com/anantcorp

Eventbrite:
https://www.eventbrite.com/o/anant-1072927283

Facebook:
https://www.facebook.com/AnantCorp/

In Apache Cassandra Lunch #54, we will discuss how you can use Apache Spark and Apache Cassandra to perform additional basic Machine Learning tasks.

Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-54-machine-learning-with-spark--cassandra-part-2/

Accompanying YouTube Video: https://youtu.be/3roCSBWQzRk

Sign Up For Our Newsletter: http://eepurl.com/grdMkn

Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/

Cassandra.Link:
https://cassandra.link/

Follow Us and Reach Us At:

Anant:
https://www.anant.us/

Awesome Cassandra:
https://github.com/Anant/awesome-cassandra

Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch

Email:
solutions@anant.us

LinkedIn:
https://www.linkedin.com/company/anant/

Twitter:
https://twitter.com/anantcorp

Eventbrite:
https://www.eventbrite.com/o/anant-1072927283

Facebook:
https://www.facebook.com/AnantCorp/

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2 (20)

Anúncio

Mais de Anant Corporation (20)

Mais recentes (20)

Anúncio

Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

  1. 1. Version 1.0 Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2 An Anant Corporation Story.
  2. 2. What is Machine Learning? ● Machine learning is “is the study of computer algorithms that improve automatically through experience and by the use of data” ○ Seen as a part of artificial intelligence. ○ Used in a large number of applications in today’s society in many different fields. ■ I.E. Medicine (drug discovery), image recognition, email filtering, ... ● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners ○ However, many open-source and easy to access tools exist for working with and using machine learning in the real world. Some examples include: ■ TensorFlow: https://www.tensorflow.org/ ■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html ○ Additionally, many well constructed resources exist for learning machine learning online ■ Basics ML courses ■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
  3. 3. Machine Learning: 4 stages of processing data ● Generally, there are four stages to a machine learning task: a. Preparation of data ■ Converting Raw data into data that may be labeled and have features/components separated b. Splitting of data ■ A portion of your data needs to be used for training your model, a part for validating that the training is “possibly okay” and then a portion for testing your model on c. Training your model d. Obtaining predictions using the trained model
  4. 4. Machine Learning - why Cassandra? ● Speed/Reliability: ○ Machine learning projects tend to involve processing very large amounts of data in order to properly train complicated models (millions of points of data are necessary for high accuracy in some applications) ■ Cassandra is very good with dealing with large amounts of incoming data. ○ Additionally, Cassandra is built upon with a masterless architecture - there are no “master” nodes, and thus there is no single point of failure. ■ Data in Cassandra is replicated and multiple copies are stored across >1 nodes, ensuring the availability of data even when nodes drop from the cluster or unexpectedly shut down. ● Scalability/Upgradability: ○ Cassandra can be scaled both horizontally and vertically: ■ Horizontal scalability allows the addition of many weaker machines as opposed to upgrading to more powerful hardware on a few machines, thus it is easy to scale up your performance with relatively little cost
  5. 5. ML - Random Forest ● The building block of a Random Forest is a concept known as a Decision Tree: ○ Decisions Tree Learning is a predictive modelling approach used in statistics, data mining and machine learning. ○ Given a set of data and labels to the data, we can build a decision tree to try and figure out how to separate the data into classes while testing on various attributes of the data. ■ Each node represents an attribute we test on
  6. 6. ML - Random Forest cont. ● A random forest, as the name suggests, consists of a large group of decisions trees made with different segments of the data. ● Suppose we have N pieces of data with M features. We arbitrarily select to make n decision trees as part of the random forest. ○ For each initial input used to make the decision trees, we take a random sample of size N from the data but with replacement (intentionally adding in some duplicate pieces of data) ■ Reasoning for this is quite complicated ○ Each decision tree also arbitrarily selects only m < M features of the data to make the decision tree based on. ■ This is known as Feature Randomness. It forces more variation amongst the trees, thus ultimately lowering correlation across trees and producing more diversification. ○ A given piece of data is then run through/classified by all of the decision trees and whichever class is predicted the most is the result of the prediction for the Random Forest model.
  7. 7. ML - Random Forest cont. ● The following diagram visually shows the process a single piece of data goes through in order to be classified by a Random Forest classifier:
  8. 8. ML - Naive Bayes classifier ● The Naive Bayes classifier is a collection of simple probability based classification algorithms based on Bayes’ Theorem: ● In a classification problem however, we are trying to obtain the probability that some object/thing is a member of a particular class given some large number of known values. If there is plenty of variation/possible values for the input data, then using Bayes’ theorem in complicated problems is not very practical if we are basing our results on probability tables.
  9. 9. ML - Naive Bayes classifier cont... ● To resolve this issue, we make the “naive” assumptions of conditional independence: assume that all features in x are mutually independent, conditional on the category Ck: ● After going through some math, we come to the following conclusion given this set of assumptions:
  10. 10. Cassandra ● Apache Cassandra is an open-source distributed No- SQL database designed to handle large volumes of data across multiple different servers ● Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability) ○ Horizontal scalability is part of why Cassandra is so powerful - cheap machines can be added to a cluster to improve its performance in a significant manner ● Note: Demo runs code from DataStax, and they use the DSE (DataStax Enterprise) version of Cassandra (not open source) ○ Spark is also run together with Cassandra in DSE
  11. 11. Spark ● Apache Spark: ○ Open-Source unified analytics engine for large-scale data processing ○ Provides an interface for programmingentire clusters with implicit data parallelism and fault tolerance ● Spark Machine Learning Library (MLlib) ○ Spark’s machine learning library ○ Features many common ML algorithms, tools for modifying data and pipelines, loading/saving algorithms, various utilities and more. ○ Primary API used to be RDD (Resilient Distributed Datasets) based, but has switched over to being DataFrame based since Spark 2.0
  12. 12. Jupyter ● Jupyter is an open-source web application for creating and sharing documents containing code, equations, markdown text and more. ○ Website for Project Jupyter: https://jupyter.org/ ● Very popular amongst data scientists ● Github repo which will be demoed in this presentation contains a few different Jupyter notebooks containing sections of Python code along with explanations/narration of the general idea of the code ● Jupyter notebooks can be used with programming languages other than python, including R, Julia, Scala and many others.
  13. 13. Demo Project Slide ● Link to Github Repo: https://github.com/HadesArchitect/CaSpark ● Contains a Docker compose file which will run three docker images: ○ DSE - DataStax Enterprise version of Cassandra with Spark ○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook ○ DataStax Studio ● Github Repo contains a lot of example files, but we will be primarily looking at two examples whose theory we discussed in this presentation: ○ Random Forest classification (Random Forest.ipynb) ○ Naive Bayes classification (Naivebayes.ipynb) ● Note: Need to make a change to the docker-compose.yml file before running docker-compose command, changing the line dealing with PYSPARK_SUBMIT_ARGS ○ Need to use a more up to date version of the DataStax Spark Cassandra Connector ○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
  14. 14. Resources ● https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/ ● https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark-cassandra/ ● https://towardsdatascience.com/understanding-random-forest-58381e0602d2 ● https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a- random-forest-depend-on-the-number-of-pred/36183 ● https://pandio.com/blog/five-ways-apache-cassandra-is-designed-to-support-machine- learning-use-cases/ ● https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead- of-without-replacement ● https://en.wikipedia.org/wiki/Naive_Bayes_classifier ● https://www.geeksforgeeks.org/naive-bayes-classifiers/ ● http://users.sussex.ac.uk/~christ/crs/ml/lec02b.html ● https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
  15. 15. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037

×