SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Spark on Scala – Reference Architecture
Adrian Tanase – Adobe Romania, Analytics
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Agenda
§ Building data processing apps with Scala and Spark
§ Our reference architecture
§ Goals
§ Abstractions
§ Techniques
§ Tips and tricks
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What is Spark?
3
§ General engine for large scale data processing w/ APIs in Java, Scala and Python
§ Batch, Streaming, Interactive
§ Multiple Spark apps running concurrently in the same cluster
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Our Requirements for Spark Apps
§ Build many data processing applications, mostly ETL and analytics
§ Batch and streaming ingestion and processing
§ Stateless and stateful aggregations
§ Consume data from Kafka, persist to HBase, HDFS and Kafka
§ Interact (real time) with external services (S3, REST via http)
§ Deployed on Mesos/Docker across AWS and Azure
4
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Real Life With Spark
§ Generic data-processing (analytics, SQL, Graph, ML)
§ BUT not generic distributed computing
§ Lacks API support for things like
§ Lifecycle events around starting / creating executors
§ e.g. instantiate a DB connection pool on remote executor
§ Sending shared data to all executors and refresh it a certain intervals
§ e.g. shared config that updates dynamically and stays in sync across all nodes
§ Async processing of events
§ e.g. HTTP non-blocking calls on the hot path
§ Control flow in case of bad things happening on remote nodes
§ e.g. pause processing or controlled shutdown if one node can’t reach an external service
5
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Our Reference Architecture
§ Basic template for building spark/scala apps in our team
§ Take advantage of Spark strong points, work around limitations
§ Decouple Spark APIs and business logic
§ Leverage strong points in Scala (blend FP and OOP)
§ Design goals – all apps should be:
§ Scalable (horizontally)
§ Reliable (at least once processing, no data loss)
§ Maintainable (easy to understand, change, remove code)
§ Testable (easy to write unit and integration tests)
§ Easy to configure (deploy time)
§ Portable (to other processing frameworks like akka or kafka streams)
6
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Sample App
§ Ingest – first component in the stack
§ Use case – basic ETL
§ load from persistent queue (Kafka)
§ unpack and validate protobuf elements
§ reach out to external config service
§ e.g. is customer active?
§ add minimal metadata (lookups to customer DB)
§ persist to data store (HBase)
§ emit for downstream processing (Kafka)
7
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Abstractions Used
8
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Main Entrypoint
§ Load / parse configuration
§ Instantiate SparkContext, DB connections, etc
§ Starts data processing (the application) by providing concrete instances for all deps
9
object IngestMain {
def main(args:  Array[String])  {
val config =  IngestConfig.loadConfig
val streamContext =  new StreamingContext(...)
val ingestApp =  getIngestApp(config)
val ingressStream =  KafkaConnectionUtils.getDStream(...)
ingestApp.process(ingressStream)
streamContext.start()
streamContext.awaitTermination()
}
}
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Application
§ Assembles services / repos into actual data processing app
§ Facilitates integration testing by not relying on actual kafka queues, hbase connections, etc
§ Only place in the code that "speaks" Spark (DStream, RDD, transform APIs, etc)
§ Change this file to port app to another streaming framework
10
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
trait IngestApp {
def ingestService:  IngestService
def eventRepo:  ExecutorSingleton[EventRepository]
def process(dstream:  DStream[Array[Byte]]):  Unit =  {
val rawEvents =  dstream.mapPartitions {  partition  =>
partition.flatMap(ingestService.toRawEvents(...))
}
processEvents(rawEvents)
}
}
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Application (2)
§ Deals with Spark complexities so that the business services don’t have to
§ Caching, progress checkpointing, controlling side effects
§ Shipping code and stateful objects (e.g. DB connection) to executors
11
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
def processEvents(events:  DStream[RawEvent]):  Unit {
val validEvents =  events.transform {  rdd =>
//  update  and  broadcast  global  config
rdd.flatMap {  event  =>
ingestService.toValidEvent(event,  ...)
}
}
validEvents.cache()
validEvents.foreachRDDOrStop {  rdd =>
rdd.foreachPartition {  partition  =>
val repo  =  eventRepo.get
partition.foreach {  ev => ingestService.saveEvent(ev, repo)  }
}
}
}
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Services
12
§ Represent the majority of business logic
§ Stateless and generally implemented as scala traits
§ Collection of pure functions grouped logically
§ Process immutable data structures, side effects are contained
§ All resources provided at invoke time, avoiding DI altogether
§ Avoids serialization issues of stateful resources (e.g. DB connection),
concerns which are pushed to the outer application layers
§ Actual materialization of trait can be deferred
§ E.g. object, service class, mix-in another class
§ Allows for a very modular architecture
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful
Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Example – Ingest Service
§ Deserialization, validation
§ Check configs (calls config service)
§ Annotate with customer metadata (loads partner DB)
§ Persist to HBase via Repository
13
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configuration
(http)
Stateful
Resources
Repository
Message
Producer
Spark
APIs
Config
Domain
modeltrait IngestService {
def toRawEvents(bytes:  Array[Byte]):  Seq[RawEvent]
def toValidEvent(
ev:  RawEvent,  configRepo: ConfigRepository):  Option[ValidEvent]
def saveEvent(
ev:  ValidEvent,  repo:  EventRepository):  Unit Or Throwable
}
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Repositories and Other Stateful Objects
§ Repo - simple abstraction for modeling KV data stores, config DBs, etc
§ Read-write or read-only
§ Simple interface makes it easy to mock in testing (e.g. HashMaps)
§ or swap out implementation (HBase, Cassandra, etc)
§ Handled differently from simple services because
§ Generally relies on stateful objects (e.g. DB connection pool)
§ Needs extra set-up and tear-down lifecycle
§ Each executor needs it’s own repo, how do you create it there?
https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/
14
Main entry point
Application
Services
e.g.
Validation
(internal)
e.g.
Configurat
ion
(http)
Stateful Resources
Repository Message
Producer
Spark
APIs
Config
Domain
model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
The Domain Model
§ Immutable entities via case classes
§ Serializable, equals and hash code, pattern matching out of the box
§ Controlled creation via smart constructors (factory + validation)
§ Enforce invariants during creation and transformation
§ No more defensive checks everywhere
§ Domain objects are guaranteed to be valid
§ Leverages the type system and compiler
http://www.cakesolutions.net/teamblogs/enforcing-invariants-in-
scala-datatypes
15
Main entry point
Application
Services
e.g.
Validati
on
(internal
)
e.g.
Configur
ation
(http)
Stateful
Resources
Reposito
ry
Message
Produce
r
Spark
APIs
Config
Domain model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Example – DataSource
§ Validations done during creation & transformation phases
§ Immutable object; can’t change after that!
16
sealed trait DataSource {
def id:  Int
}
case  object  GlobalDataSource extends  DataSource {
val id  =  0
}
sealed abstract case class ExternalDataSource(id:  Int)  extends DataSource
object DataSource {
def apply(id:  Int):  Option[DataSource]  =  id  match {
case invalid  if invalid  <  0 =>  None
case GlobalDataSource.id =>  Some(GlobalDataSource)
case anyDsId =>  Some(ExternalDataSource(anyDsId))
}
}
Main entry point
Application
Services
e.g.
Validatio
n
(internal)
e.g.
Configur
ation
(http)
Stateful
Resources
Reposito
ry
Message
Producer
Spark
APIs
Config
Domain model
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Other Tips and Tricks
§ Typesafe config + ficus for powerful, zero boilerplate app config
https://github.com/iheartradio/ficus
§ Option / Try / Either for error handling
http://longcao.org/2015/07/09/functional-error-accumulation-in-scala
§ Unit/Integration testing for spark apps
https://github.com/holdenk/spark-testing-base
17
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Conclusion – Reaching Our Design Goals
§ Scalable
§ Maintainable
§ Testable
§ Easy Configurable
§ Portable
18
§ Only the app “speaks” Spark
§ Business logic and domain model can be swapped out easily
§ Config is a static typed class hierarchy
§ Free to parse via typesafe-config / ficus
§ Clear concerns at app level
§ Modular code
§ Pure functions
§ Immutable data structures
§ Pure functions are easy to unit test
§ The App interface makes integration tests easy
Use FP in the small, OOP in the large!
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Let’s Keep in Touch!
§ Adrian Tanase
atanase@adobe.com
§ We’re hiring!
http://bit.ly/adro-careers
19
20

Mais conteúdo relacionado

Mais procurados

SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Chris Asano.dba.20160512a
Chris Asano.dba.20160512aChris Asano.dba.20160512a
Chris Asano.dba.20160512a
Chris Asano
 

Mais procurados (20)

Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
Database Cloud Services Office Hours : Oracle sharding hyperscale globally d...
Database Cloud Services Office Hours : Oracle sharding  hyperscale globally d...Database Cloud Services Office Hours : Oracle sharding  hyperscale globally d...
Database Cloud Services Office Hours : Oracle sharding hyperscale globally d...
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Enterprise Postgres
Enterprise PostgresEnterprise Postgres
Enterprise Postgres
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
 
Native REST Web Services with Oracle 11g
Native REST Web Services with Oracle 11gNative REST Web Services with Oracle 11g
Native REST Web Services with Oracle 11g
 
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should Have
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Enterprise Data Classification and Provenance
Enterprise Data Classification and ProvenanceEnterprise Data Classification and Provenance
Enterprise Data Classification and Provenance
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Chris Asano.dba.20160512a
Chris Asano.dba.20160512aChris Asano.dba.20160512a
Chris Asano.dba.20160512a
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Connecting your .Net Applications to NoSQL Databases - MongoDB & Cassandra
Connecting your .Net Applications to NoSQL Databases - MongoDB & CassandraConnecting your .Net Applications to NoSQL Databases - MongoDB & Cassandra
Connecting your .Net Applications to NoSQL Databases - MongoDB & Cassandra
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 

Semelhante a Spark and scala reference architecture

Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
Himansu Behera
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
Rajeev Kumar
 
RABI SHANKAR PAL_New
RABI SHANKAR PAL_NewRABI SHANKAR PAL_New
RABI SHANKAR PAL_New
rabi pal
 
Sap integration with_j_boss_technologies
Sap integration with_j_boss_technologiesSap integration with_j_boss_technologies
Sap integration with_j_boss_technologies
Serge Pagop
 

Semelhante a Spark and scala reference architecture (20)

Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
RABI SHANKAR PAL_New
RABI SHANKAR PAL_NewRABI SHANKAR PAL_New
RABI SHANKAR PAL_New
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Sap integration with_j_boss_technologies
Sap integration with_j_boss_technologiesSap integration with_j_boss_technologies
Sap integration with_j_boss_technologies
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Full-Stack JavaScript Development on SAP HANA Platform
Full-Stack JavaScript Development on SAP HANA PlatformFull-Stack JavaScript Development on SAP HANA Platform
Full-Stack JavaScript Development on SAP HANA Platform
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 

Último

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Spark and scala reference architecture

  • 1. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Spark on Scala – Reference Architecture Adrian Tanase – Adobe Romania, Analytics
  • 2. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Agenda § Building data processing apps with Scala and Spark § Our reference architecture § Goals § Abstractions § Techniques § Tips and tricks
  • 3. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. What is Spark? 3 § General engine for large scale data processing w/ APIs in Java, Scala and Python § Batch, Streaming, Interactive § Multiple Spark apps running concurrently in the same cluster
  • 4. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Our Requirements for Spark Apps § Build many data processing applications, mostly ETL and analytics § Batch and streaming ingestion and processing § Stateless and stateful aggregations § Consume data from Kafka, persist to HBase, HDFS and Kafka § Interact (real time) with external services (S3, REST via http) § Deployed on Mesos/Docker across AWS and Azure 4
  • 5. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Real Life With Spark § Generic data-processing (analytics, SQL, Graph, ML) § BUT not generic distributed computing § Lacks API support for things like § Lifecycle events around starting / creating executors § e.g. instantiate a DB connection pool on remote executor § Sending shared data to all executors and refresh it a certain intervals § e.g. shared config that updates dynamically and stays in sync across all nodes § Async processing of events § e.g. HTTP non-blocking calls on the hot path § Control flow in case of bad things happening on remote nodes § e.g. pause processing or controlled shutdown if one node can’t reach an external service 5
  • 6. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Our Reference Architecture § Basic template for building spark/scala apps in our team § Take advantage of Spark strong points, work around limitations § Decouple Spark APIs and business logic § Leverage strong points in Scala (blend FP and OOP) § Design goals – all apps should be: § Scalable (horizontally) § Reliable (at least once processing, no data loss) § Maintainable (easy to understand, change, remove code) § Testable (easy to write unit and integration tests) § Easy to configure (deploy time) § Portable (to other processing frameworks like akka or kafka streams) 6
  • 7. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Sample App § Ingest – first component in the stack § Use case – basic ETL § load from persistent queue (Kafka) § unpack and validate protobuf elements § reach out to external config service § e.g. is customer active? § add minimal metadata (lookups to customer DB) § persist to data store (HBase) § emit for downstream processing (Kafka) 7
  • 8. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Abstractions Used 8 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 9. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Main Entrypoint § Load / parse configuration § Instantiate SparkContext, DB connections, etc § Starts data processing (the application) by providing concrete instances for all deps 9 object IngestMain { def main(args:  Array[String])  { val config =  IngestConfig.loadConfig val streamContext =  new StreamingContext(...) val ingestApp =  getIngestApp(config) val ingressStream =  KafkaConnectionUtils.getDStream(...) ingestApp.process(ingressStream) streamContext.start() streamContext.awaitTermination() } } Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 10. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Application § Assembles services / repos into actual data processing app § Facilitates integration testing by not relying on actual kafka queues, hbase connections, etc § Only place in the code that "speaks" Spark (DStream, RDD, transform APIs, etc) § Change this file to port app to another streaming framework 10 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model trait IngestApp { def ingestService:  IngestService def eventRepo:  ExecutorSingleton[EventRepository] def process(dstream:  DStream[Array[Byte]]):  Unit =  { val rawEvents =  dstream.mapPartitions {  partition  => partition.flatMap(ingestService.toRawEvents(...)) } processEvents(rawEvents) } }
  • 11. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Application (2) § Deals with Spark complexities so that the business services don’t have to § Caching, progress checkpointing, controlling side effects § Shipping code and stateful objects (e.g. DB connection) to executors 11 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model def processEvents(events:  DStream[RawEvent]):  Unit { val validEvents =  events.transform {  rdd => //  update  and  broadcast  global  config rdd.flatMap {  event  => ingestService.toValidEvent(event,  ...) } } validEvents.cache() validEvents.foreachRDDOrStop {  rdd => rdd.foreachPartition {  partition  => val repo  =  eventRepo.get partition.foreach {  ev => ingestService.saveEvent(ev, repo)  } } } }
  • 12. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Services 12 § Represent the majority of business logic § Stateless and generally implemented as scala traits § Collection of pure functions grouped logically § Process immutable data structures, side effects are contained § All resources provided at invoke time, avoiding DI altogether § Avoids serialization issues of stateful resources (e.g. DB connection), concerns which are pushed to the outer application layers § Actual materialization of trait can be deferred § E.g. object, service class, mix-in another class § Allows for a very modular architecture Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 13. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Example – Ingest Service § Deserialization, validation § Check configs (calls config service) § Annotate with customer metadata (loads partner DB) § Persist to HBase via Repository 13 Main entry point Application Services e.g. Validation (internal) e.g. Configuration (http) Stateful Resources Repository Message Producer Spark APIs Config Domain modeltrait IngestService { def toRawEvents(bytes:  Array[Byte]):  Seq[RawEvent] def toValidEvent( ev:  RawEvent,  configRepo: ConfigRepository):  Option[ValidEvent] def saveEvent( ev:  ValidEvent,  repo:  EventRepository):  Unit Or Throwable }
  • 14. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Repositories and Other Stateful Objects § Repo - simple abstraction for modeling KV data stores, config DBs, etc § Read-write or read-only § Simple interface makes it easy to mock in testing (e.g. HashMaps) § or swap out implementation (HBase, Cassandra, etc) § Handled differently from simple services because § Generally relies on stateful objects (e.g. DB connection pool) § Needs extra set-up and tear-down lifecycle § Each executor needs it’s own repo, how do you create it there? https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/ 14 Main entry point Application Services e.g. Validation (internal) e.g. Configurat ion (http) Stateful Resources Repository Message Producer Spark APIs Config Domain model
  • 15. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. The Domain Model § Immutable entities via case classes § Serializable, equals and hash code, pattern matching out of the box § Controlled creation via smart constructors (factory + validation) § Enforce invariants during creation and transformation § No more defensive checks everywhere § Domain objects are guaranteed to be valid § Leverages the type system and compiler http://www.cakesolutions.net/teamblogs/enforcing-invariants-in- scala-datatypes 15 Main entry point Application Services e.g. Validati on (internal ) e.g. Configur ation (http) Stateful Resources Reposito ry Message Produce r Spark APIs Config Domain model
  • 16. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Example – DataSource § Validations done during creation & transformation phases § Immutable object; can’t change after that! 16 sealed trait DataSource { def id:  Int } case  object  GlobalDataSource extends  DataSource { val id  =  0 } sealed abstract case class ExternalDataSource(id:  Int)  extends DataSource object DataSource { def apply(id:  Int):  Option[DataSource]  =  id  match { case invalid  if invalid  <  0 =>  None case GlobalDataSource.id =>  Some(GlobalDataSource) case anyDsId =>  Some(ExternalDataSource(anyDsId)) } } Main entry point Application Services e.g. Validatio n (internal) e.g. Configur ation (http) Stateful Resources Reposito ry Message Producer Spark APIs Config Domain model
  • 17. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Other Tips and Tricks § Typesafe config + ficus for powerful, zero boilerplate app config https://github.com/iheartradio/ficus § Option / Try / Either for error handling http://longcao.org/2015/07/09/functional-error-accumulation-in-scala § Unit/Integration testing for spark apps https://github.com/holdenk/spark-testing-base 17
  • 18. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Conclusion – Reaching Our Design Goals § Scalable § Maintainable § Testable § Easy Configurable § Portable 18 § Only the app “speaks” Spark § Business logic and domain model can be swapped out easily § Config is a static typed class hierarchy § Free to parse via typesafe-config / ficus § Clear concerns at app level § Modular code § Pure functions § Immutable data structures § Pure functions are easy to unit test § The App interface makes integration tests easy Use FP in the small, OOP in the large!
  • 19. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Let’s Keep in Touch! § Adrian Tanase atanase@adobe.com § We’re hiring! http://bit.ly/adro-careers 19
  • 20. 20