SlideShare uma empresa Scribd logo
1 de 53
Baixar para ler offline
OpenSistemas
Open Source Innovation for your Business
Apache Spark y como lo usamos en nuestros proyectos
Who am I
Data Solutions Manager at OpenSistemas
Former President of Hispalinux (Spanish
LUG)
Author “La Pastilla Roja” first spanish book
about Free Software
Menú
Un poco de historia: Hadoop
Qué es Apache Spark
Arquitectura Kappa
Algunos casos reales con Spark
Un poco de historia
En principio de los tiempos
Los Papers de Google
Nacimiento del proyecto Hadoop
La era pre Spark
Google publica algunos papers claves:
●
HDFS
●
MapReduce
●
HBASE
La era pre Spark
Nace Hadoop
●
Primera implementación libre
●
Yahoo es el sponsor
●
Se desarrolla uno de los entornos de trabajo
más interesantes y productivos de los años
2000
Batch-Hadoop (MR1)
Batch-MapReduce
Batch-Cascading
Cosas que no terminan de
gustarnos de Hadoop
Algunas cosas de Hadoop no terminan de
convencernos
No todo los problemas se solucionan con
Map Reduce
Hadoop ha generado un montón de
proyectos y tecnologías para resolverlo
Mientras tanto...
Hadoop ha cumplido 10 años
Ha habido muchos cambios tanto en hardware
como en software
In memory Computing
Y de pronto nace:
Apache Spark
Es un motor de analítica paralelizado
Escrito desde cero
Heredando lo mejor de la experiencia
“hadoop”
Probablemente la obra de ingeniería más
bella de los últimos años
Qué es Apache Spark
La primera reseña importante esta pensado
desde el principio para aprovechar al máximo
la memoria
Resiliente y escalable por definición
Qué es Apache Spark
Escrito, pensado e influido por el lenguaje
Scala
Spark, the elevator pitch
Spark streaming
Spark SQL
Cuando cualquier información con un
esquema se puede consultar con SQL
A little context
July 2, 2014 Jay Kreps coined the term Kappa
Architecture in an article for O’reilly Radar
Who is Jay Kreps
Jay has been involved in lots of projects
●
Author of the essay
●
The Log: What every software engineer
should know about real-time data's unifying
abstraction (12/16/2013)
●
https://engineering.linkedin.com/distributed-systems/log-what-every-software-
engineer-should-know-about-real-time-datas-unifying
Jay Kreps
Author of the book: I ♥ Logs
Jay Kreps
Involved with projects as:
●
Apache Kafka
●
Apache Samza
●
Voldemort
●
Azkaban
●
Ex-Linkedin
●
Now co-founder and CEO of Confluent
Lambda Architecture
Look something like this:
https://www.mapr.com/developercentral/lambda-architecture
Lambda Architecture
Batch layer that provides the following
functionality
●
managing the master dataset, an
immutable, append-only set of raw data
●
pre-computing arbitrary query functions,
called batch views
https://www.mapr.com/developercentral/lambda-architecture
Lambda Architecture
Serving layer
●
This layer indexes the batch views so that they can be
queried in ad hoc with low latency
Speed layer
●
This layer accommodates all requests that are subject
to low latency requirements. Using fast and
incremental algorithms, the speed layer deals with
recent data only
Lambda Architecture
Batch layer datasets can be in a distributed filesystem, while
MapReduce can be used to create batch views that can be fed
to the serving layer
The serving layer can be implemented using NoSQL
technologies such as HBase,Apache Druid, etc
Querying can be implemented by technologies such as Apache
Drill or Impala
Speed layer can be realized with data streaming technologies
such as Apache Storm or Spark Streaming
https://www.mapr.com/developercentral/lambda-architecture
Pros of Lambda Architecture
Retain the input data unchanged
●
Think about modeling data transformations, series of
data states from the original input
Lambda architecture take in account the problem of
reprocessing data
●
This happens all the time, the code will change, and
you will need to reprocess all the information. Lots of
reasons and you will need to live with this.
Cons of Lambda Architecture
Maintain the code that need to produce the same result
from two complex distributed system is painful
●
Very different code for MapReduce and Storm/Apache Spark
Not only is about different code, is also about debugging
and interaction with other products like (hive, Oozie,
Cascading, etc)
At the end is a problem about different and diverging
programming paradigms
So what is Kappa Architecture
The proposal of Jay Kreps is so simple
●
Use kafka (or other system) that will let you retain
the full log of the data you need to reprocess
●
When you want to do the reprocessing, start a
second instance of your stream processing job that
starts processing from the beginning of the
retained data, but direct this output data to a new
output table
part II
●
When the second job has caught up, switch
the application to read from the new table
●
Stop the old version of the job, and delete
the old output table
So what is Kappa Architecture
This architecture looks something like this
So what is Kappa Architecture
The first benefit is that only you need to reprocessing
only when you change the code
You can check if the new version is working ok and if
not reverse to the old output table
You can mirror a Kafka topic to HDFS so you are not
limited to the Kafka retention configuration
You have only a code to maintain with an unique
framework
So what is Kappa Architecture
The real advantage is not about efficiency at
all (You will need extra temporarily storage
when reprocessing for example) is allowing
your team to develop, test, debug and
operate their systems on top of a single
processing framework
So what is Kappa Architecture
Is not a silver bullet to solve every problem at
Big Data
Is not a list of prescriptions of technologies.
You can implement with your favorite
frameworks
Is not a rigid set of rules. But helps to maintain
the complex projects simple
What is not Kappa Architecture
We start working with projects with a
complex structure like Linkedin looks at early
stage
That’s very usual
How we use Kappa
Architecture
How we use Kappa
Architecture
We try to refactoring the data flows to fix in
a Kappa Architecture
How we use Kappa
Architecture
How we use Kappa
Architecture
We use Kafka as Stream Data Platform
Instead of Samza we feel more comfortable
with Spark Streaming
At ASPGems we choose Apache Spark as our
Analytics Engine and not only for Spark
Streaming
How we use Kappa
Architecture
At the end, Kappa Architecture is design
pattern for us
We use/clone this pattern in almost our
projects
We have projects of every size, volume of data
or speed needing and fix with the Kappa
Architecture
How we use Kappa
Architecture
Telefónica - MSS
We use KA to calculate near real time KPIs,
SLAs related with the managed security system
We simplify the data flow of the input data
Kafka in the streaming data platform
As MPP we use CassandraDB
Use cases
IOT – OBD II
One of our clients install On Board Devices in the cars
of its customers.
We implement an API to got all the information in real
time and inject the information in Kafka
The business rules are implemented in a CEP running
into Apache Spark Streaming
As MPP we use Elastic Search
Use cases
Insurance company
We implement Kappa Architecture to process
click stream in real time and clustering users
We show content and offers that better fix
users
Use cases
Energy Facility
We implement Kappa Architecture to process and
predict energy consume.
Our customer include energy storage systems and we
got all the information about energy storage (ultra-
capacitors and batteries).
We process this information to calculate the effective
lifetime of the components and its degradation.
Use cases
Questions
Thank you
Juantomás García
juantomas@opensistemas.com
@juantomas
www.opensistemas.com
comercial@opensistemas.com

Mais conteúdo relacionado

Mais procurados

Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
Neo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafkajexp
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Pactera_US
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaSpark Summit
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at TwitterPrasad Wagle
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsDatabricks
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 

Mais procurados (20)

Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
Neo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafka
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at Twitter
 
Operationalize Apache Spark Analytics
Operationalize Apache Spark AnalyticsOperationalize Apache Spark Analytics
Operationalize Apache Spark Analytics
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 

Destaque

El software como acción humana
El software como acción humanaEl software como acción humana
El software como acción humanaOpenSistemas
 
NoSQL: Un Cambio de Paradigma - Apache Cassandra
NoSQL: Un Cambio de Paradigma - Apache CassandraNoSQL: Un Cambio de Paradigma - Apache Cassandra
NoSQL: Un Cambio de Paradigma - Apache CassandraWladimir Cabarcas
 
El futuro Data Driven en e-Learning y RR.HH.
El futuro Data Driven en e-Learning y RR.HH.El futuro Data Driven en e-Learning y RR.HH.
El futuro Data Driven en e-Learning y RR.HH.OpenSistemas
 
Tutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeTutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeSocialmetrix
 
Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016LibreCon
 
Área de Soporte - OpenSistemas
Área de Soporte - OpenSistemasÁrea de Soporte - OpenSistemas
Área de Soporte - OpenSistemasOpenSistemas
 
Drupal 7. Puesta en producción en sistemas multientorno
Drupal 7. Puesta en producción en sistemas multientornoDrupal 7. Puesta en producción en sistemas multientorno
Drupal 7. Puesta en producción en sistemas multientornoOpenSistemas
 
Área Education - OpenSistemas
Área Education - OpenSistemasÁrea Education - OpenSistemas
Área Education - OpenSistemasOpenSistemas
 
Cómo crear ports en FreeBSD #PicnicCode2015
Cómo crear ports en FreeBSD #PicnicCode2015Cómo crear ports en FreeBSD #PicnicCode2015
Cómo crear ports en FreeBSD #PicnicCode2015OpenSistemas
 
osBrain: una herramienta para la inversión automática en bolsa y mercados de ...
osBrain: una herramienta para la inversión automática en bolsa y mercados de ...osBrain: una herramienta para la inversión automática en bolsa y mercados de ...
osBrain: una herramienta para la inversión automática en bolsa y mercados de ...OpenSistemas
 
Minería de datos para trading automático
Minería de datos para trading automáticoMinería de datos para trading automático
Minería de datos para trading automáticoOpenSistemas
 

Destaque (13)

El software como acción humana
El software como acción humanaEl software como acción humana
El software como acción humana
 
NoSQL: Un Cambio de Paradigma - Apache Cassandra
NoSQL: Un Cambio de Paradigma - Apache CassandraNoSQL: Un Cambio de Paradigma - Apache Cassandra
NoSQL: Un Cambio de Paradigma - Apache Cassandra
 
El futuro Data Driven en e-Learning y RR.HH.
El futuro Data Driven en e-Learning y RR.HH.El futuro Data Driven en e-Learning y RR.HH.
El futuro Data Driven en e-Learning y RR.HH.
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Tutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeTutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtime
 
Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016Kappa Architecture, IoT of the cars - LibreCon 2016
Kappa Architecture, IoT of the cars - LibreCon 2016
 
Área de Soporte - OpenSistemas
Área de Soporte - OpenSistemasÁrea de Soporte - OpenSistemas
Área de Soporte - OpenSistemas
 
Drupal 7. Puesta en producción en sistemas multientorno
Drupal 7. Puesta en producción en sistemas multientornoDrupal 7. Puesta en producción en sistemas multientorno
Drupal 7. Puesta en producción en sistemas multientorno
 
Área Education - OpenSistemas
Área Education - OpenSistemasÁrea Education - OpenSistemas
Área Education - OpenSistemas
 
Cómo crear ports en FreeBSD #PicnicCode2015
Cómo crear ports en FreeBSD #PicnicCode2015Cómo crear ports en FreeBSD #PicnicCode2015
Cómo crear ports en FreeBSD #PicnicCode2015
 
osBrain: una herramienta para la inversión automática en bolsa y mercados de ...
osBrain: una herramienta para la inversión automática en bolsa y mercados de ...osBrain: una herramienta para la inversión automática en bolsa y mercados de ...
osBrain: una herramienta para la inversión automática en bolsa y mercados de ...
 
Minería de datos para trading automático
Minería de datos para trading automáticoMinería de datos para trading automático
Minería de datos para trading automático
 

Semelhante a Apache Spark y como lo usamos en nuestros proyectos

Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with SparkKnoldus Inc.
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 

Semelhante a Apache Spark y como lo usamos en nuestros proyectos (20)

Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Module01
 Module01 Module01
Module01
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 

Mais de OpenSistemas

OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas
 
Data Platform & Analytics OpenSistemas MSFT Playbook
Data Platform & Analytics OpenSistemas MSFT PlaybookData Platform & Analytics OpenSistemas MSFT Playbook
Data Platform & Analytics OpenSistemas MSFT PlaybookOpenSistemas
 
Proceso de liberación en el marco legal del código abierto - OpenSistemas
Proceso de liberación en el marco legal del código abierto - OpenSistemasProceso de liberación en el marco legal del código abierto - OpenSistemas
Proceso de liberación en el marco legal del código abierto - OpenSistemasOpenSistemas
 
Virtualization - Solaris LDOMs - OpenSistemas
Virtualization - Solaris LDOMs - OpenSistemasVirtualization - Solaris LDOMs - OpenSistemas
Virtualization - Solaris LDOMs - OpenSistemasOpenSistemas
 
CACert - A Community-driven Certification Authority - OpenSistemas
CACert - A Community-driven Certification Authority - OpenSistemasCACert - A Community-driven Certification Authority - OpenSistemas
CACert - A Community-driven Certification Authority - OpenSistemasOpenSistemas
 
Floss leaders - OpenSistemas
Floss leaders - OpenSistemasFloss leaders - OpenSistemas
Floss leaders - OpenSistemasOpenSistemas
 
Business Intelligence and Pentaho Services - OpenSistemas
Business Intelligence and Pentaho Services - OpenSistemasBusiness Intelligence and Pentaho Services - OpenSistemas
Business Intelligence and Pentaho Services - OpenSistemasOpenSistemas
 
easyGTD - product Info
easyGTD - product InfoeasyGTD - product Info
easyGTD - product InfoOpenSistemas
 
easyGTD - presentación producto
easyGTD - presentación productoeasyGTD - presentación producto
easyGTD - presentación productoOpenSistemas
 

Mais de OpenSistemas (10)

From SF with Love
From SF with LoveFrom SF with Love
From SF with Love
 
OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate Presentation
 
Data Platform & Analytics OpenSistemas MSFT Playbook
Data Platform & Analytics OpenSistemas MSFT PlaybookData Platform & Analytics OpenSistemas MSFT Playbook
Data Platform & Analytics OpenSistemas MSFT Playbook
 
Proceso de liberación en el marco legal del código abierto - OpenSistemas
Proceso de liberación en el marco legal del código abierto - OpenSistemasProceso de liberación en el marco legal del código abierto - OpenSistemas
Proceso de liberación en el marco legal del código abierto - OpenSistemas
 
Virtualization - Solaris LDOMs - OpenSistemas
Virtualization - Solaris LDOMs - OpenSistemasVirtualization - Solaris LDOMs - OpenSistemas
Virtualization - Solaris LDOMs - OpenSistemas
 
CACert - A Community-driven Certification Authority - OpenSistemas
CACert - A Community-driven Certification Authority - OpenSistemasCACert - A Community-driven Certification Authority - OpenSistemas
CACert - A Community-driven Certification Authority - OpenSistemas
 
Floss leaders - OpenSistemas
Floss leaders - OpenSistemasFloss leaders - OpenSistemas
Floss leaders - OpenSistemas
 
Business Intelligence and Pentaho Services - OpenSistemas
Business Intelligence and Pentaho Services - OpenSistemasBusiness Intelligence and Pentaho Services - OpenSistemas
Business Intelligence and Pentaho Services - OpenSistemas
 
easyGTD - product Info
easyGTD - product InfoeasyGTD - product Info
easyGTD - product Info
 
easyGTD - presentación producto
easyGTD - presentación productoeasyGTD - presentación producto
easyGTD - presentación producto
 

Último

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 

Último (20)

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 

Apache Spark y como lo usamos en nuestros proyectos

  • 1. OpenSistemas Open Source Innovation for your Business Apache Spark y como lo usamos en nuestros proyectos
  • 2. Who am I Data Solutions Manager at OpenSistemas Former President of Hispalinux (Spanish LUG) Author “La Pastilla Roja” first spanish book about Free Software
  • 3. Menú Un poco de historia: Hadoop Qué es Apache Spark Arquitectura Kappa Algunos casos reales con Spark
  • 4. Un poco de historia En principio de los tiempos Los Papers de Google Nacimiento del proyecto Hadoop
  • 5. La era pre Spark Google publica algunos papers claves: ● HDFS ● MapReduce ● HBASE
  • 6. La era pre Spark Nace Hadoop ● Primera implementación libre ● Yahoo es el sponsor ● Se desarrolla uno de los entornos de trabajo más interesantes y productivos de los años 2000
  • 10. Cosas que no terminan de gustarnos de Hadoop Algunas cosas de Hadoop no terminan de convencernos No todo los problemas se solucionan con Map Reduce Hadoop ha generado un montón de proyectos y tecnologías para resolverlo
  • 11.
  • 12. Mientras tanto... Hadoop ha cumplido 10 años Ha habido muchos cambios tanto en hardware como en software
  • 14. Y de pronto nace: Apache Spark Es un motor de analítica paralelizado Escrito desde cero Heredando lo mejor de la experiencia “hadoop” Probablemente la obra de ingeniería más bella de los últimos años
  • 15. Qué es Apache Spark La primera reseña importante esta pensado desde el principio para aprovechar al máximo la memoria Resiliente y escalable por definición
  • 16. Qué es Apache Spark Escrito, pensado e influido por el lenguaje Scala
  • 17.
  • 19.
  • 20.
  • 21.
  • 22.
  • 24. Spark SQL Cuando cualquier información con un esquema se puede consultar con SQL
  • 25. A little context July 2, 2014 Jay Kreps coined the term Kappa Architecture in an article for O’reilly Radar
  • 26. Who is Jay Kreps Jay has been involved in lots of projects ● Author of the essay ● The Log: What every software engineer should know about real-time data's unifying abstraction (12/16/2013) ● https://engineering.linkedin.com/distributed-systems/log-what-every-software- engineer-should-know-about-real-time-datas-unifying
  • 27. Jay Kreps Author of the book: I ♥ Logs
  • 28. Jay Kreps Involved with projects as: ● Apache Kafka ● Apache Samza ● Voldemort ● Azkaban ● Ex-Linkedin ● Now co-founder and CEO of Confluent
  • 29. Lambda Architecture Look something like this: https://www.mapr.com/developercentral/lambda-architecture
  • 30. Lambda Architecture Batch layer that provides the following functionality ● managing the master dataset, an immutable, append-only set of raw data ● pre-computing arbitrary query functions, called batch views https://www.mapr.com/developercentral/lambda-architecture
  • 31. Lambda Architecture Serving layer ● This layer indexes the batch views so that they can be queried in ad hoc with low latency Speed layer ● This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only
  • 32. Lambda Architecture Batch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer The serving layer can be implemented using NoSQL technologies such as HBase,Apache Druid, etc Querying can be implemented by technologies such as Apache Drill or Impala Speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming https://www.mapr.com/developercentral/lambda-architecture
  • 33. Pros of Lambda Architecture Retain the input data unchanged ● Think about modeling data transformations, series of data states from the original input Lambda architecture take in account the problem of reprocessing data ● This happens all the time, the code will change, and you will need to reprocess all the information. Lots of reasons and you will need to live with this.
  • 34. Cons of Lambda Architecture Maintain the code that need to produce the same result from two complex distributed system is painful ● Very different code for MapReduce and Storm/Apache Spark Not only is about different code, is also about debugging and interaction with other products like (hive, Oozie, Cascading, etc) At the end is a problem about different and diverging programming paradigms
  • 35. So what is Kappa Architecture The proposal of Jay Kreps is so simple ● Use kafka (or other system) that will let you retain the full log of the data you need to reprocess ● When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table
  • 36. part II ● When the second job has caught up, switch the application to read from the new table ● Stop the old version of the job, and delete the old output table So what is Kappa Architecture
  • 37. This architecture looks something like this So what is Kappa Architecture
  • 38. The first benefit is that only you need to reprocessing only when you change the code You can check if the new version is working ok and if not reverse to the old output table You can mirror a Kafka topic to HDFS so you are not limited to the Kafka retention configuration You have only a code to maintain with an unique framework So what is Kappa Architecture
  • 39. The real advantage is not about efficiency at all (You will need extra temporarily storage when reprocessing for example) is allowing your team to develop, test, debug and operate their systems on top of a single processing framework So what is Kappa Architecture
  • 40. Is not a silver bullet to solve every problem at Big Data Is not a list of prescriptions of technologies. You can implement with your favorite frameworks Is not a rigid set of rules. But helps to maintain the complex projects simple What is not Kappa Architecture
  • 41. We start working with projects with a complex structure like Linkedin looks at early stage That’s very usual How we use Kappa Architecture
  • 42. How we use Kappa Architecture
  • 43. We try to refactoring the data flows to fix in a Kappa Architecture How we use Kappa Architecture
  • 44. How we use Kappa Architecture
  • 45. We use Kafka as Stream Data Platform Instead of Samza we feel more comfortable with Spark Streaming At ASPGems we choose Apache Spark as our Analytics Engine and not only for Spark Streaming How we use Kappa Architecture
  • 46. At the end, Kappa Architecture is design pattern for us We use/clone this pattern in almost our projects We have projects of every size, volume of data or speed needing and fix with the Kappa Architecture How we use Kappa Architecture
  • 47. Telefónica - MSS We use KA to calculate near real time KPIs, SLAs related with the managed security system We simplify the data flow of the input data Kafka in the streaming data platform As MPP we use CassandraDB Use cases
  • 48. IOT – OBD II One of our clients install On Board Devices in the cars of its customers. We implement an API to got all the information in real time and inject the information in Kafka The business rules are implemented in a CEP running into Apache Spark Streaming As MPP we use Elastic Search Use cases
  • 49. Insurance company We implement Kappa Architecture to process click stream in real time and clustering users We show content and offers that better fix users Use cases
  • 50. Energy Facility We implement Kappa Architecture to process and predict energy consume. Our customer include energy storage systems and we got all the information about energy storage (ultra- capacitors and batteries). We process this information to calculate the effective lifetime of the components and its degradation. Use cases