SlideShare uma empresa Scribd logo
1 de 24
Spark Streaming As Near 
Realtime ETL 
Paris Data Geek 
18/09/2014 
Djamel Zouaoui 
@DjamelOnLine
Who am I ? 
Djamel Zouaoui 
Director Of Engineering 
@DjamelOnLine 
#Data 
#Scala 
#RecSys #Tech 
#MachineLearning 
#NoSql 
#BigData 
#Spark 
#Dev 
#R 
#Architecture
What is 
Fast and Expressive Cluster Computing 
Engine Compatible with Apache Hadoop 
• Efficient • Usable 
• General execution 
graphs 
• In-memory storage 
• Rich APIs in Java, 
Scala, Python 
• Interactive shell
RDD in 
• Resilient Distributed Dataset 
• Storage abstraction for dataset in Spark 
• Imutable 
• Fault recovery 
– Each RDD remembers how it was created, and can recover if any part of 
the data is lost 
• 3 kinds of operations 
– Transformations: Lazy in nature, allow to create a new dataset from one 
– Actions: Returns a value or exports data after performing a computation 
– Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect()
textFiles map map reduceByKey 
collect 
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect()
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect() 
textFiles map map reduceByKey 
collect 
textFiles map map reduceByKey 
collect 
Stage 1 Stage 2
sparkContext.textFiles("hdfs://…") 
.flatmap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey((a, b) => a + b) 
.collect() 
textFiles map map reduceByKey 
collect 
textFiles map map reduceByKey 
collect 
Stage 1 Stage 2 
Stage 1 Stage 2
Ecosystem 
RDD-Based 
Matrices 
RDD-Based 
Graphs 
Spark RDD API 
DStream’s: 
Streams of RDD’s 
Spark 
Streaming GraphX MLLib 
RDD-Based 
Tables 
Spark 
SQL 
HDFS, S3, Cassandra 
YARN, Mesos, 
Standalone
What is 
Project started in early 2012, extends Spark 
for doing big data stream processing which: 
Scales to hundreds of nodes 
Achieves second-scale latencies 
Efficiently recover from failures 
Integrates with batch and interactive processing
How it works ?
How it works ?
How it works ? 
• Input Source 
Definition 
• Input D-Stream 
D-Stream Computations 
• Window level 
• Statefull option 
• … 
Classic RDDs 
manipulation 
• Transformation 
• Action
Code 
TOPOLOG 
Y 
FREE 
//StreamingContext & Input source creation 
//Standard transformations 
//Window usage 
//Start the streaming and put it in the background
Internals 
• Two main processes 
– Receivers in charge of the D-Stream creation 
– Workers which in charge of data processing 
• These processes are autonomous & independent 
– No cores & resources shared 
– No information shared
Execution Model – Receiving Data 
Spark Streaming + Spark Driver Spark Workers 
StreamingContext.start() 
Network 
Input 
Tracker 
Receiver 
Data 
received 
Blocks pushed 
Blocks replicated 
Block 
Manager 
Block 
Manager 
Master 
Block 
Manager
Execution Model – Job Scheduling 
Spark Streaming + Spark Driver 
Network 
Input 
Tracker 
RDDs Block IDs 
Job Scheduler 
Spark’s 
Schedulers 
Receiver 
Block 
Manager 
Block 
Manager 
Jobs executed on 
worker nodes 
DStream 
Graph 
Job 
Manager 
Job Queue 
Jobs
Use Case: Find The True Love ! 
Build a recommender system based on implicit 
and explicit data to find the best matching for you 
• Based on Machine Learning models 
• Processed offline (batch) 
• On big (bunch of) data 
• Main goals of streaming platforms : 
– Need to store a lot of data 
– Need to clean them 
– Need to transform them
Overview 
Data 
Receiver 
Data 
Cleaning 
job 
KAFKA 
Topics 
Data 
Modelin 
g 
job 
HDFS 
Storage 
HDFS 
Storage 
Spark Cluster 
• Spark in Standalone mode 
• 120 cores available on Spark 
• 4.5 GB RAM per core 
• Based on Hadoop cluster for 
HDFS storage (10 To) 
• HDP 2.0 
• 8 machines (2 masters, 6 
slaves)
Data 
Receiver 
Data 
Cleaning 
job 
Data 
Modelin 
g 
job 
HDFS 
Storage 
HDFS 
Storage 
• Use of provided Kafka 
source 
• Naive implementation: 
– Based on 
autocommit 
– Automatic Offset 
management 
• Cleaning with classic 
RDD transformations 
• Persist new RDDs 
– In HDFS for other spark 
job (batch) 
– In RAM to speed up 
next step 
• Binary matrix 
• Scoring based on current 
events and history 
– History is load from 
RDDs stored on HDFS 
Job details
Issues 
• Data Lost 
– In the receiver phase due to naive kafka consumer 
– Need a more robust client with handly offset management (VS 
autocommit) 
• The delights of (de)serialisation 
– Kryo / Avro / Parquet…: Not directly due to Spark but not ease 
Major issues are during import/export steps
And Beyond…@VIADEO 
More than ETL, an analytics backend 
Data 
Receiver 
Data 
Modelin 
g 
RabbitMQ 
Data 
Modelin 
Generic 
Index 
ElasticSearch 
Spark Cluster Cluster 
D3.JS 
webapp 
Data 
Modelin 
g 
g
Join the Viadeo adventure 
Wanted: Software Engineers 
• We use Node.js, Spark, 
ElasticSearch, CQRS, AWS and 
many more… 
• We love FullStack Engineers and 
flat organization 
• We work in autonomous product 
team 
• We lunch for free ;-)
QUESTIONS ?

Mais conteúdo relacionado

Mais procurados

AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
Amazon Web Services
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 

Mais procurados (20)

Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018Data Tells the Story - Greenplum Summit 2018
Data Tells the Story - Greenplum Summit 2018
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 

Semelhante a Paris Data Geek - Spark Streaming

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Semelhante a Paris Data Geek - Spark Streaming (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 

Mais de Djamel Zouaoui

Datajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandationDatajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandation
Djamel Zouaoui
 
Usi 2013 - NoSql les defis à relever
Usi 2013 -  NoSql les defis à releverUsi 2013 -  NoSql les defis à relever
Usi 2013 - NoSql les defis à relever
Djamel Zouaoui
 
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Djamel Zouaoui
 
USI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIUSI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SI
Djamel Zouaoui
 
Retour d'expérience TechLead
Retour d'expérience TechLeadRetour d'expérience TechLead
Retour d'expérience TechLead
Djamel Zouaoui
 
Présentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisésPrésentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisés
Djamel Zouaoui
 
USI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueUSI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continue
Djamel Zouaoui
 
USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !
Djamel Zouaoui
 

Mais de Djamel Zouaoui (9)

Datajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandationDatajob 2013 - Construire un système de recommandation
Datajob 2013 - Construire un système de recommandation
 
Usi 2013 - NoSql les defis à relever
Usi 2013 -  NoSql les defis à releverUsi 2013 -  NoSql les defis à relever
Usi 2013 - NoSql les defis à relever
 
ParisDataGeek - L amour est dans le graphe
ParisDataGeek - L amour est dans le grapheParisDataGeek - L amour est dans le graphe
ParisDataGeek - L amour est dans le graphe
 
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
 
USI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIUSI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SI
 
Retour d'expérience TechLead
Retour d'expérience TechLeadRetour d'expérience TechLead
Retour d'expérience TechLead
 
Présentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisésPrésentation Alt.net - Tests unitaires automatisés
Présentation Alt.net - Tests unitaires automatisés
 
USI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueUSI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continue
 
USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !USI 2011 - De l offshore qui fonctionne !
USI 2011 - De l offshore qui fonctionne !
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Paris Data Geek - Spark Streaming

  • 1. Spark Streaming As Near Realtime ETL Paris Data Geek 18/09/2014 Djamel Zouaoui @DjamelOnLine
  • 2. Who am I ? Djamel Zouaoui Director Of Engineering @DjamelOnLine #Data #Scala #RecSys #Tech #MachineLearning #NoSql #BigData #Spark #Dev #R #Architecture
  • 3. What is Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop • Efficient • Usable • General execution graphs • In-memory storage • Rich APIs in Java, Scala, Python • Interactive shell
  • 4. RDD in • Resilient Distributed Dataset • Storage abstraction for dataset in Spark • Imutable • Fault recovery – Each RDD remembers how it was created, and can recover if any part of the data is lost • 3 kinds of operations – Transformations: Lazy in nature, allow to create a new dataset from one – Actions: Returns a value or exports data after performing a computation – Persistence: caching dataset (on Disk/Ram/Mixed) for future operations
  • 5. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  • 6. textFiles map map reduceByKey collect sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect()
  • 7. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2
  • 8. sparkContext.textFiles("hdfs://…") .flatmap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey((a, b) => a + b) .collect() textFiles map map reduceByKey collect textFiles map map reduceByKey collect Stage 1 Stage 2 Stage 1 Stage 2
  • 9. Ecosystem RDD-Based Matrices RDD-Based Graphs Spark RDD API DStream’s: Streams of RDD’s Spark Streaming GraphX MLLib RDD-Based Tables Spark SQL HDFS, S3, Cassandra YARN, Mesos, Standalone
  • 10. What is Project started in early 2012, extends Spark for doing big data stream processing which: Scales to hundreds of nodes Achieves second-scale latencies Efficiently recover from failures Integrates with batch and interactive processing
  • 13. How it works ? • Input Source Definition • Input D-Stream D-Stream Computations • Window level • Statefull option • … Classic RDDs manipulation • Transformation • Action
  • 14. Code TOPOLOG Y FREE //StreamingContext & Input source creation //Standard transformations //Window usage //Start the streaming and put it in the background
  • 15. Internals • Two main processes – Receivers in charge of the D-Stream creation – Workers which in charge of data processing • These processes are autonomous & independent – No cores & resources shared – No information shared
  • 16. Execution Model – Receiving Data Spark Streaming + Spark Driver Spark Workers StreamingContext.start() Network Input Tracker Receiver Data received Blocks pushed Blocks replicated Block Manager Block Manager Master Block Manager
  • 17. Execution Model – Job Scheduling Spark Streaming + Spark Driver Network Input Tracker RDDs Block IDs Job Scheduler Spark’s Schedulers Receiver Block Manager Block Manager Jobs executed on worker nodes DStream Graph Job Manager Job Queue Jobs
  • 18. Use Case: Find The True Love ! Build a recommender system based on implicit and explicit data to find the best matching for you • Based on Machine Learning models • Processed offline (batch) • On big (bunch of) data • Main goals of streaming platforms : – Need to store a lot of data – Need to clean them – Need to transform them
  • 19. Overview Data Receiver Data Cleaning job KAFKA Topics Data Modelin g job HDFS Storage HDFS Storage Spark Cluster • Spark in Standalone mode • 120 cores available on Spark • 4.5 GB RAM per core • Based on Hadoop cluster for HDFS storage (10 To) • HDP 2.0 • 8 machines (2 masters, 6 slaves)
  • 20. Data Receiver Data Cleaning job Data Modelin g job HDFS Storage HDFS Storage • Use of provided Kafka source • Naive implementation: – Based on autocommit – Automatic Offset management • Cleaning with classic RDD transformations • Persist new RDDs – In HDFS for other spark job (batch) – In RAM to speed up next step • Binary matrix • Scoring based on current events and history – History is load from RDDs stored on HDFS Job details
  • 21. Issues • Data Lost – In the receiver phase due to naive kafka consumer – Need a more robust client with handly offset management (VS autocommit) • The delights of (de)serialisation – Kryo / Avro / Parquet…: Not directly due to Spark but not ease Major issues are during import/export steps
  • 22. And Beyond…@VIADEO More than ETL, an analytics backend Data Receiver Data Modelin g RabbitMQ Data Modelin Generic Index ElasticSearch Spark Cluster Cluster D3.JS webapp Data Modelin g g
  • 23. Join the Viadeo adventure Wanted: Software Engineers • We use Node.js, Spark, ElasticSearch, CQRS, AWS and many more… • We love FullStack Engineers and flat organization • We work in autonomous product team • We lunch for free ;-)