SlideShare uma empresa Scribd logo
1 de 22
Is Spark the right choice for Data
Analysis ?
Ahmed Kamal, Big Data Engineer
http://ahmedkamal.me
Resources ?
●“Advanced Analytics using Spark”, a practical book !
●“The thing I like most about this book is its focus on
examples, which are all drawn from real applications on real-
world data sets.” - Matei Zaharia, CTO at Databricks.
●It is all about developing data applications using Spark
Data Applications, like what ?
●Build a model to detect credit card fraud using thousands of
features and billions of transactions.
●Intelligently recommend millions of products to millions of
users.
●Estimate financial risk through simulations of portfolios
including millions of instruments.
●Easily manipulate data from thousands of human genomes
to detect genetic associations with disease.
Doing something useful with data
●Often, “doing something useful” = Placing a schema over it
and using SQL to answer questions like
●“of the gazillion users who made it to the third page in
our registration process, how many are over 25?”
●The field of how to structure a data warehouse and
organize information to make answering these kinds of
questions easy is a rich one.
A new superpower !
●When people say that we live in an age of “big data,” they
mean that we have tools for collecting, storing, and
processing information at a scale previously unheard of.
●There is a gap between having access to these tools and all
this data, and doing something useful with it.
Doing extra useful things
●Requirements :
a- Flexible programming model
b- Rich functionality in machine learning and statistics
●Existing Tools :
R, Python (PyData stack) and Octave
Pros : Little effort, easy to use
Cons : Viable only to small data sets, too complex to
redesign to be suitable for working over clusters of
computers.
Why is it difficult ?
●Some algorithms (like machine learning algos) would have
wide data dependencies.
• Data are partitioned across nodes.
• Network transfer is much sloooower than memory
accesses.
●What about the probability of failures ?
●Summary : We need a programming paradigm that is
sensitive to the c/c of the underlying system and that
encourages good choices and make it easy to write parallel
code.
High performance Computing
●Use Case : processing a large file full of DNA sequencing
reads in parallel
●1- Manually split the file into smaller files
●2- Submitting a job for each file split to the scheduler
●3- Continuous jobs monitoring to resubmit any failed jobs
●All to all operations like sorting the full data would require
streaming through one node or to go and use MPI.
●Relatively low level of abstraction and difficulty of use in
addition to the high cost.
The 3 truths about data science
●Successful data preprocessing is a must for successful
analysis.
–Large data sets requires special treatment
–Feature engineering should be given more time than the
time spent on the algorithms stuff. (A model for fraud
detection can use IP location info, login times, click logs)
–How would you convert features into vectors suitable for ML
algorithms.
The 3 truths about data science
●Iteration is the key.
–Famous optimization techniques like Gradient Descent
requires repeated scans over the input until convergence
–You can't get it right from the first time.
(Features/Algo/Test)
Analytics between lab and factory
A framework that makes modeling easy
but is also a good fit for production systems is a huge
win.
Apache Spark In Points
●Spark continues from what Hadoop Shines at (Linear
Scalability , Fault Tolerance)
●Spark supports DAG (Direct Acyclic Graph of operators)
●Complements its capabilities with rich set of
transformations.
●In-memory processing. (Suitable for iterations)
Apache Spark In Points
●The most important bottleneck that Spark addresses is
analyst productivity. (R, HDFS, MR, .. etc)
●Spark is better at being an operational system than most
exploratory systems and better for data exploration than
the technologies commonly used in operational systems.
●Standing on top of JVM – Good integration with Hadoop
ecosystem
Spark From the other side !
●Still young compared to MapReduce
●Its main components needs a lot of work to be mature
enough (stream processing, SQL, machine learning, and
graph processing)
–MLlib’s pipelines and transformer API model is in progress
–Its statistics and modeling functionality comes nowhere near that of
single machine languages like R
–Its SQL functionality is rich, but still lags far behind that of Hive.
Spark Programming Model
●It starts with a dataset or a few residing in a distributed
persistent storage (like HDFS)
●Writing a Spark program typically consists of a few related
steps:
–Defining a set of transformations on input data sets.
–Invoking actions that output the transformed data sets to persistent
storage or return results to the driver’s local memory.
–Running local computations that operate on the results computed in a
distributed fashion. These can help you decide what transformations
and actions to undertake next.
Why should you consider Scala ?
●Spark has already different wrappers (Java, python)
●It reduces performance overhead. (Running your different
language of top of JVM)
●It gives you access to the latest and greatest.
●It will help you understand the Spark philosophy.
–If you know how to use Spark in Scala, even if you primarily
use it from other languages, you’ll have a better
understanding of the system and will be in a better position
to “think in Spark.”
If you are immune to boredom,
there is literally nothing you cannot
accomplish.
—David Foster Wallace
Data Science's First Step
●Data cleansing is the first step in any data science project.
●Many clever analyses have been undone because the data
analyzed had fundamental quality problems or bias problem.
●It is a dull work that you have to do before you can get to
the really cool machine learning algorithm that you’ve been
dying to apply to a new problem.
Our First Real Problem !
●Name : Record Linkage
●Description :
–we have a large collection of records from one or more
source systems
–it is likely that some of the records refer to the same
underlying entity, such as a customer, a patient.
–Each of the entities has a number of attributes, such as a
name or address
The Challenge
●Challenge :
–The values of these attributes aren’t perfect
–Values might have different formatting, or typos, or missing
information.
–It is easy for a human to understand and identify at a
glance, but is difficult for a computer to learn.
Steps we are going to take
●Bringing Data from the Cluster to the Client
●Shipping Code from the Client to the Cluster
●Structuring Data with Tuples and Case Classes
●Getting some numbers regarding our data.
The End
Thanks A lot :)

Mais conteúdo relacionado

Mais procurados

Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientistsAjay Ohri
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEEFINALYEARSTUDENTPROJECTS
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Simplilearn
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Sri Ambati
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science LifecycleSwapnilDahake2
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Formulatedby
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data ScienceRoger Huang
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics Farheen Nilofer
 
Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMao Ye
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Simplilearn
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Nisha Talagala
 

Mais procurados (20)

Evolution of big data
Evolution of big dataEvolution of big data
Evolution of big data
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Cheat sheets for data scientists
Cheat sheets for data scientistsCheat sheets for data scientists
Cheat sheets for data scientists
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
 
Data science
Data scienceData science
Data science
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 

Destaque

2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly Review2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly ReviewAwangku Kasman
 
Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web Mireya Marquez
 
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuhApakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuhhenry jaya teddy
 
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วงแบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วงYoYoo Showbiz
 
Тренинг по новым медиа
Тренинг по новым медиаТренинг по новым медиа
Тренинг по новым медиаSergey
 

Destaque (11)

Dia 25 abril
Dia 25 abrilDia 25 abril
Dia 25 abril
 
CONMEMORACÍON DE TODOS LOS DIFUNTOS
CONMEMORACÍON DE TODOS LOS DIFUNTOSCONMEMORACÍON DE TODOS LOS DIFUNTOS
CONMEMORACÍON DE TODOS LOS DIFUNTOS
 
2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly Review2013_07 - Saturn Monthly Review
2013_07 - Saturn Monthly Review
 
Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web Entorno De Las Aplicaciones Web
Entorno De Las Aplicaciones Web
 
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuhApakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
Apakah saudara sudah tahu berapabanyakuangygberedardiduniaburuh
 
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วงแบบฟอร์มศาลาบริการ-ศาลาสีม่วง
แบบฟอร์มศาลาบริการ-ศาลาสีม่วง
 
Techknow
TechknowTechknow
Techknow
 
ateeq
ateeqateeq
ateeq
 
Тренинг по новым медиа
Тренинг по новым медиаТренинг по новым медиа
Тренинг по новым медиа
 
AFZAAL AHMAD
AFZAAL AHMADAFZAAL AHMAD
AFZAAL AHMAD
 
3 l'usurpateur d'irsmun
3   l'usurpateur d'irsmun3   l'usurpateur d'irsmun
3 l'usurpateur d'irsmun
 

Semelhante a Is Spark the right choice for data analysis ?

Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark ZaranTech LLC
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 

Semelhante a Is Spark the right choice for data analysis ? (20)

Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Spark
SparkSpark
Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Tutorial4
Tutorial4Tutorial4
Tutorial4
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Último (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Is Spark the right choice for data analysis ?

  • 1. Is Spark the right choice for Data Analysis ? Ahmed Kamal, Big Data Engineer http://ahmedkamal.me
  • 2. Resources ? ●“Advanced Analytics using Spark”, a practical book ! ●“The thing I like most about this book is its focus on examples, which are all drawn from real applications on real- world data sets.” - Matei Zaharia, CTO at Databricks. ●It is all about developing data applications using Spark
  • 3. Data Applications, like what ? ●Build a model to detect credit card fraud using thousands of features and billions of transactions. ●Intelligently recommend millions of products to millions of users. ●Estimate financial risk through simulations of portfolios including millions of instruments. ●Easily manipulate data from thousands of human genomes to detect genetic associations with disease.
  • 4. Doing something useful with data ●Often, “doing something useful” = Placing a schema over it and using SQL to answer questions like ●“of the gazillion users who made it to the third page in our registration process, how many are over 25?” ●The field of how to structure a data warehouse and organize information to make answering these kinds of questions easy is a rich one.
  • 5. A new superpower ! ●When people say that we live in an age of “big data,” they mean that we have tools for collecting, storing, and processing information at a scale previously unheard of. ●There is a gap between having access to these tools and all this data, and doing something useful with it.
  • 6. Doing extra useful things ●Requirements : a- Flexible programming model b- Rich functionality in machine learning and statistics ●Existing Tools : R, Python (PyData stack) and Octave Pros : Little effort, easy to use Cons : Viable only to small data sets, too complex to redesign to be suitable for working over clusters of computers.
  • 7. Why is it difficult ? ●Some algorithms (like machine learning algos) would have wide data dependencies. • Data are partitioned across nodes. • Network transfer is much sloooower than memory accesses. ●What about the probability of failures ? ●Summary : We need a programming paradigm that is sensitive to the c/c of the underlying system and that encourages good choices and make it easy to write parallel code.
  • 8. High performance Computing ●Use Case : processing a large file full of DNA sequencing reads in parallel ●1- Manually split the file into smaller files ●2- Submitting a job for each file split to the scheduler ●3- Continuous jobs monitoring to resubmit any failed jobs ●All to all operations like sorting the full data would require streaming through one node or to go and use MPI. ●Relatively low level of abstraction and difficulty of use in addition to the high cost.
  • 9. The 3 truths about data science ●Successful data preprocessing is a must for successful analysis. –Large data sets requires special treatment –Feature engineering should be given more time than the time spent on the algorithms stuff. (A model for fraud detection can use IP location info, login times, click logs) –How would you convert features into vectors suitable for ML algorithms.
  • 10. The 3 truths about data science ●Iteration is the key. –Famous optimization techniques like Gradient Descent requires repeated scans over the input until convergence –You can't get it right from the first time. (Features/Algo/Test)
  • 11. Analytics between lab and factory A framework that makes modeling easy but is also a good fit for production systems is a huge win.
  • 12. Apache Spark In Points ●Spark continues from what Hadoop Shines at (Linear Scalability , Fault Tolerance) ●Spark supports DAG (Direct Acyclic Graph of operators) ●Complements its capabilities with rich set of transformations. ●In-memory processing. (Suitable for iterations)
  • 13. Apache Spark In Points ●The most important bottleneck that Spark addresses is analyst productivity. (R, HDFS, MR, .. etc) ●Spark is better at being an operational system than most exploratory systems and better for data exploration than the technologies commonly used in operational systems. ●Standing on top of JVM – Good integration with Hadoop ecosystem
  • 14. Spark From the other side ! ●Still young compared to MapReduce ●Its main components needs a lot of work to be mature enough (stream processing, SQL, machine learning, and graph processing) –MLlib’s pipelines and transformer API model is in progress –Its statistics and modeling functionality comes nowhere near that of single machine languages like R –Its SQL functionality is rich, but still lags far behind that of Hive.
  • 15. Spark Programming Model ●It starts with a dataset or a few residing in a distributed persistent storage (like HDFS) ●Writing a Spark program typically consists of a few related steps: –Defining a set of transformations on input data sets. –Invoking actions that output the transformed data sets to persistent storage or return results to the driver’s local memory. –Running local computations that operate on the results computed in a distributed fashion. These can help you decide what transformations and actions to undertake next.
  • 16. Why should you consider Scala ? ●Spark has already different wrappers (Java, python) ●It reduces performance overhead. (Running your different language of top of JVM) ●It gives you access to the latest and greatest. ●It will help you understand the Spark philosophy. –If you know how to use Spark in Scala, even if you primarily use it from other languages, you’ll have a better understanding of the system and will be in a better position to “think in Spark.”
  • 17. If you are immune to boredom, there is literally nothing you cannot accomplish. —David Foster Wallace
  • 18. Data Science's First Step ●Data cleansing is the first step in any data science project. ●Many clever analyses have been undone because the data analyzed had fundamental quality problems or bias problem. ●It is a dull work that you have to do before you can get to the really cool machine learning algorithm that you’ve been dying to apply to a new problem.
  • 19. Our First Real Problem ! ●Name : Record Linkage ●Description : –we have a large collection of records from one or more source systems –it is likely that some of the records refer to the same underlying entity, such as a customer, a patient. –Each of the entities has a number of attributes, such as a name or address
  • 20. The Challenge ●Challenge : –The values of these attributes aren’t perfect –Values might have different formatting, or typos, or missing information. –It is easy for a human to understand and identify at a glance, but is difficult for a computer to learn.
  • 21. Steps we are going to take ●Bringing Data from the Cluster to the Client ●Shipping Code from the Client to the Cluster ●Structuring Data with Tuples and Case Classes ●Getting some numbers regarding our data.