SlideShare uma empresa Scribd logo
1 de 15
Version 1.0
Apache Cassandra Lunch #54: Machine
Learning with Spark + Cassandra Part 2
An Anant Corporation Story.
What is Machine Learning?
● Machine learning is “is the study of computer algorithms that improve automatically through
experience and by the use of data”
○ Seen as a part of artificial intelligence.
○ Used in a large number of applications in today’s society in many different fields.
■ I.E. Medicine (drug discovery), image recognition, email filtering, ...
● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear
algebra), making it appear intimidating to many beginners
○ However, many open-source and easy to access tools exist for working with and using machine learning in
the real world. Some examples include:
■ TensorFlow: https://www.tensorflow.org/
■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html
○ Additionally, many well constructed resources exist for learning machine learning online
■ Basics ML courses
■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures:
https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
Machine Learning: 4 stages of processing data
● Generally, there are four stages to a machine
learning task:
a. Preparation of data
■ Converting Raw data into data that
may be labeled and have
features/components separated
b. Splitting of data
■ A portion of your data needs to be
used for training your model, a part
for validating that the training is
“possibly okay” and then a portion for
testing your model on
c. Training your model
d. Obtaining predictions using the trained
model
Machine Learning - why Cassandra?
● Speed/Reliability:
○ Machine learning projects tend to involve processing very large amounts of data in order to properly train
complicated models (millions of points of data are necessary for high accuracy in some applications)
■ Cassandra is very good with dealing with large amounts of incoming data.
○ Additionally, Cassandra is built upon with a masterless architecture - there are no “master” nodes, and thus
there is no single point of failure.
■ Data in Cassandra is replicated and multiple copies are stored across >1 nodes, ensuring the
availability of data even when nodes drop from the cluster or unexpectedly shut down.
● Scalability/Upgradability:
○ Cassandra can be scaled both horizontally and vertically:
■ Horizontal scalability allows the addition of many weaker machines as opposed to upgrading to more
powerful hardware on a few machines, thus it is easy to scale up your performance with relatively
little cost
ML - Random Forest
● The building block of a Random Forest is a concept known as a Decision Tree:
○ Decisions Tree Learning is a predictive modelling approach used in statistics, data mining and machine
learning.
○ Given a set of data and labels to the data, we can build a decision tree to try and figure out how to separate
the data into classes while testing on various attributes of the data.
■ Each node represents an attribute we test on
ML - Random Forest cont.
● A random forest, as the name suggests, consists of a large group of decisions trees made with
different segments of the data.
● Suppose we have N pieces of data with M features. We arbitrarily select to make n decision trees
as part of the random forest.
○ For each initial input used to make the decision trees, we take a random sample of size N from the data but
with replacement (intentionally adding in some duplicate pieces of data)
■ Reasoning for this is quite complicated
○ Each decision tree also arbitrarily selects only m < M features of the data to make the decision tree based on.
■ This is known as Feature Randomness. It forces more variation amongst the trees, thus ultimately
lowering correlation across trees and producing more diversification.
○ A given piece of data is then run through/classified by all of the decision trees and whichever class is
predicted the most is the result of the prediction for the Random Forest model.
ML - Random Forest cont.
● The following diagram visually shows the process a single piece of data goes through in order to be
classified by a Random Forest classifier:
ML - Naive Bayes classifier
● The Naive Bayes classifier is a collection of simple probability based classification algorithms
based on Bayes’ Theorem:
● In a classification problem however, we are trying to obtain the probability that some object/thing
is a member of a particular class given some large number of known values. If there is plenty of
variation/possible values for the input data, then using Bayes’ theorem in complicated problems is
not very practical if we are basing our results on probability tables.
ML - Naive Bayes classifier cont...
● To resolve this issue, we make the “naive” assumptions of conditional independence: assume that
all features in x are mutually independent, conditional on the category Ck:
● After going through some math, we come to the following conclusion given this set of assumptions:
Cassandra
● Apache Cassandra is an open-source distributed No-
SQL database designed to handle large volumes of data
across multiple different servers
● Cassandra clusters can be upgraded by either
improving hardware on current nodes (vertical
scalability) or adding more nodes (horizontal
scalability)
○ Horizontal scalability is part of why Cassandra is so
powerful - cheap machines can be added to a cluster to
improve its performance in a significant manner
● Note: Demo runs code from DataStax, and they use the
DSE (DataStax Enterprise) version of Cassandra (not
open source)
○ Spark is also run together with Cassandra in DSE
Spark
● Apache Spark:
○ Open-Source unified analytics engine for large-scale
data processing
○ Provides an interface for programmingentire clusters
with implicit data parallelism and fault tolerance
● Spark Machine Learning Library (MLlib)
○ Spark’s machine learning library
○ Features many common ML algorithms, tools for
modifying data and pipelines, loading/saving
algorithms, various utilities and more.
○ Primary API used to be RDD (Resilient Distributed
Datasets) based, but has switched over to being
DataFrame based since Spark 2.0
Jupyter
● Jupyter is an open-source web application for creating and
sharing documents containing code, equations, markdown
text and more.
○ Website for Project Jupyter: https://jupyter.org/
● Very popular amongst data scientists
● Github repo which will be demoed in this presentation
contains a few different Jupyter notebooks containing
sections of Python code along with explanations/narration
of the general idea of the code
● Jupyter notebooks can be used with programming languages
other than python, including R, Julia, Scala and many others.
Demo Project Slide
● Link to Github Repo: https://github.com/HadesArchitect/CaSpark
● Contains a Docker compose file which will run three docker images:
○ DSE - DataStax Enterprise version of Cassandra with Spark
○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook
○ DataStax Studio
● Github Repo contains a lot of example files, but we will be primarily looking at two examples
whose theory we discussed in this presentation:
○ Random Forest classification (Random Forest.ipynb)
○ Naive Bayes classification (Naivebayes.ipynb)
● Note: Need to make a change to the docker-compose.yml file before running docker-compose
command, changing the line dealing with PYSPARK_SUBMIT_ARGS
○ Need to use a more up to date version of the DataStax Spark Cassandra Connector
○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf
spark.cassandra.connection.host=dse pyspark-shell'
Resources
● https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/
● https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark-cassandra/
● https://towardsdatascience.com/understanding-random-forest-58381e0602d2
● https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a-
random-forest-depend-on-the-number-of-pred/36183
● https://pandio.com/blog/five-ways-apache-cassandra-is-designed-to-support-machine-
learning-use-cases/
● https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead-
of-without-replacement
● https://en.wikipedia.org/wiki/Naive_Bayes_classifier
● https://www.geeksforgeeks.org/naive-bayes-classifiers/
● http://users.sussex.ac.uk/~christ/crs/ml/lec02b.html
● https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Mais conteúdo relacionado

Mais procurados

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Monitoring open suse_asia_summit_2016
Monitoring open suse_asia_summit_2016Monitoring open suse_asia_summit_2016
Monitoring open suse_asia_summit_2016Didiet A. Pambudiono
 
Traffic Matrices and its measurement
Traffic Matrices and its measurementTraffic Matrices and its measurement
Traffic Matrices and its measurementeetacupc
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 

Mais procurados (20)

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Monitoring open suse_asia_summit_2016
Monitoring open suse_asia_summit_2016Monitoring open suse_asia_summit_2016
Monitoring open suse_asia_summit_2016
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Traffic Matrices and its measurement
Traffic Matrices and its measurementTraffic Matrices and its measurement
Traffic Matrices and its measurement
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Spark
SparkSpark
Spark
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 

Semelhante a Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackShapeBlue
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 

Semelhante a Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2 (20)

Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStack
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Hadoop on-mesos
Hadoop on-mesosHadoop on-mesos
Hadoop on-mesos
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Spark
SparkSpark
Spark
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Py tables
Py tablesPy tables
Py tables
 
PyTables
PyTablesPyTables
PyTables
 
Large Data Analyze With PyTables
Large Data Analyze With PyTablesLarge Data Analyze With PyTables
Large Data Analyze With PyTables
 

Mais de Anant Corporation

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfAnant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & FutureAnant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraAnant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 

Mais de Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 

Último

Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 

Último (20)

Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 

Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2

  • 1. Version 1.0 Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2 An Anant Corporation Story.
  • 2. What is Machine Learning? ● Machine learning is “is the study of computer algorithms that improve automatically through experience and by the use of data” ○ Seen as a part of artificial intelligence. ○ Used in a large number of applications in today’s society in many different fields. ■ I.E. Medicine (drug discovery), image recognition, email filtering, ... ● Machine learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners ○ However, many open-source and easy to access tools exist for working with and using machine learning in the real world. Some examples include: ■ TensorFlow: https://www.tensorflow.org/ ■ Spark Machine Learning Library (MLlib): http://spark.apache.org/docs/latest/ml-guide.html ○ Additionally, many well constructed resources exist for learning machine learning online ■ Basics ML courses ■ More advanced, theory oriented look at ML i.e. Caltech CS 156 lectures: https://www.youtube.com/watch?v=mbyG85GZ0PI&list=PLD63A284B7615313A
  • 3. Machine Learning: 4 stages of processing data ● Generally, there are four stages to a machine learning task: a. Preparation of data ■ Converting Raw data into data that may be labeled and have features/components separated b. Splitting of data ■ A portion of your data needs to be used for training your model, a part for validating that the training is “possibly okay” and then a portion for testing your model on c. Training your model d. Obtaining predictions using the trained model
  • 4. Machine Learning - why Cassandra? ● Speed/Reliability: ○ Machine learning projects tend to involve processing very large amounts of data in order to properly train complicated models (millions of points of data are necessary for high accuracy in some applications) ■ Cassandra is very good with dealing with large amounts of incoming data. ○ Additionally, Cassandra is built upon with a masterless architecture - there are no “master” nodes, and thus there is no single point of failure. ■ Data in Cassandra is replicated and multiple copies are stored across >1 nodes, ensuring the availability of data even when nodes drop from the cluster or unexpectedly shut down. ● Scalability/Upgradability: ○ Cassandra can be scaled both horizontally and vertically: ■ Horizontal scalability allows the addition of many weaker machines as opposed to upgrading to more powerful hardware on a few machines, thus it is easy to scale up your performance with relatively little cost
  • 5. ML - Random Forest ● The building block of a Random Forest is a concept known as a Decision Tree: ○ Decisions Tree Learning is a predictive modelling approach used in statistics, data mining and machine learning. ○ Given a set of data and labels to the data, we can build a decision tree to try and figure out how to separate the data into classes while testing on various attributes of the data. ■ Each node represents an attribute we test on
  • 6. ML - Random Forest cont. ● A random forest, as the name suggests, consists of a large group of decisions trees made with different segments of the data. ● Suppose we have N pieces of data with M features. We arbitrarily select to make n decision trees as part of the random forest. ○ For each initial input used to make the decision trees, we take a random sample of size N from the data but with replacement (intentionally adding in some duplicate pieces of data) ■ Reasoning for this is quite complicated ○ Each decision tree also arbitrarily selects only m < M features of the data to make the decision tree based on. ■ This is known as Feature Randomness. It forces more variation amongst the trees, thus ultimately lowering correlation across trees and producing more diversification. ○ A given piece of data is then run through/classified by all of the decision trees and whichever class is predicted the most is the result of the prediction for the Random Forest model.
  • 7. ML - Random Forest cont. ● The following diagram visually shows the process a single piece of data goes through in order to be classified by a Random Forest classifier:
  • 8. ML - Naive Bayes classifier ● The Naive Bayes classifier is a collection of simple probability based classification algorithms based on Bayes’ Theorem: ● In a classification problem however, we are trying to obtain the probability that some object/thing is a member of a particular class given some large number of known values. If there is plenty of variation/possible values for the input data, then using Bayes’ theorem in complicated problems is not very practical if we are basing our results on probability tables.
  • 9. ML - Naive Bayes classifier cont... ● To resolve this issue, we make the “naive” assumptions of conditional independence: assume that all features in x are mutually independent, conditional on the category Ck: ● After going through some math, we come to the following conclusion given this set of assumptions:
  • 10. Cassandra ● Apache Cassandra is an open-source distributed No- SQL database designed to handle large volumes of data across multiple different servers ● Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability) ○ Horizontal scalability is part of why Cassandra is so powerful - cheap machines can be added to a cluster to improve its performance in a significant manner ● Note: Demo runs code from DataStax, and they use the DSE (DataStax Enterprise) version of Cassandra (not open source) ○ Spark is also run together with Cassandra in DSE
  • 11. Spark ● Apache Spark: ○ Open-Source unified analytics engine for large-scale data processing ○ Provides an interface for programmingentire clusters with implicit data parallelism and fault tolerance ● Spark Machine Learning Library (MLlib) ○ Spark’s machine learning library ○ Features many common ML algorithms, tools for modifying data and pipelines, loading/saving algorithms, various utilities and more. ○ Primary API used to be RDD (Resilient Distributed Datasets) based, but has switched over to being DataFrame based since Spark 2.0
  • 12. Jupyter ● Jupyter is an open-source web application for creating and sharing documents containing code, equations, markdown text and more. ○ Website for Project Jupyter: https://jupyter.org/ ● Very popular amongst data scientists ● Github repo which will be demoed in this presentation contains a few different Jupyter notebooks containing sections of Python code along with explanations/narration of the general idea of the code ● Jupyter notebooks can be used with programming languages other than python, including R, Julia, Scala and many others.
  • 13. Demo Project Slide ● Link to Github Repo: https://github.com/HadesArchitect/CaSpark ● Contains a Docker compose file which will run three docker images: ○ DSE - DataStax Enterprise version of Cassandra with Spark ○ Jupyter/pyspark-notebook: https://hub.docker.com/r/jupyter/pyspark-notebook ○ DataStax Studio ● Github Repo contains a lot of example files, but we will be primarily looking at two examples whose theory we discussed in this presentation: ○ Random Forest classification (Random Forest.ipynb) ○ Naive Bayes classification (Naivebayes.ipynb) ● Note: Need to make a change to the docker-compose.yml file before running docker-compose command, changing the line dealing with PYSPARK_SUBMIT_ARGS ○ Need to use a more up to date version of the DataStax Spark Cassandra Connector ○ PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
  • 14. Resources ● https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/ ● https://blog.anant.us/apache-cassandra-lunch-50-machine-learning-with-spark-cassandra/ ● https://towardsdatascience.com/understanding-random-forest-58381e0602d2 ● https://stats.stackexchange.com/questions/36165/does-the-optimal-number-of-trees-in-a- random-forest-depend-on-the-number-of-pred/36183 ● https://pandio.com/blog/five-ways-apache-cassandra-is-designed-to-support-machine- learning-use-cases/ ● https://www.quora.com/Why-does-random-forest-use-sampling-with-replacement-instead- of-without-replacement ● https://en.wikipedia.org/wiki/Naive_Bayes_classifier ● https://www.geeksforgeeks.org/naive-bayes-classifiers/ ● http://users.sussex.ac.uk/~christ/crs/ml/lec02b.html ● https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
  • 15. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037