The presentation covers lambda architecture and implementation with spark. In the presentation we will discuss about components of lambda architecture like batch layer, speed layer and serving layer. We will also discuss its advantages and benefits with spark.
2. AgendaAgenda
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
3. AgendaAgenda
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
4. “ Lambda architecture is a data-processing architecture designed to
handle massive quantities of data by taking advantage of both batch-
and stream-processing methods.”
wikipedia
“ Lambda architecture is a data-processing architecture designed to
handle massive quantities of data by taking advantage of both batch-
and stream-processing methods.”
wikipedia
What is Lambda Architecture ?What is Lambda Architecture ?
Coined by Nathan marz
➢ Ex- Twitter Engineer
➢ Creator of Apache Storm
5. AgendaAgenda
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
6. Components of Lambda ArchitectureComponents of Lambda Architecture
Lambda architecture broadly classified into three layer :-
➢ Batch Layer
➢ Speed Layer
➢ Serving Layer
7. Overview of Lambda ArchitectureOverview of Lambda Architecture
https://www.mapr.com/developercentral/lambda-architecture
8. Batch LayerBatch Layer
In the Lambda Architecture, the batch layer precomputes the
master dataset into batch views so that queries can be resolved
with low latency.
9. Master DataSetMaster DataSet
The master dataset is the source of truth in the Lambda Archi-
tecture. Even if you were to lose all your serving layer datasets
and speed layer datasets, you could reconstruct your
application from the master dataset.
Data in master dataset must hold three properties :-
➢ Data is raw
➢ Data is immutable
➢ Data is eternally true
10. Computing functions on the batch layerComputing functions on the batch layer
As our master dataset is continually growing, we must have a
strategy for updating our batch views when new data becomes
available.
Here we have two suitable computing algorithm :-
➢ Recomputation algorithms : Throwing away the old batch views
and recomputing functions over the entire master dataset.
➢ Incremental algorithms : An incremental algorithm will update the
views directly when new data arrives.
11. Speed LayerSpeed Layer
There are two major facets of the speed layer: storing the
realtime views and processing the incoming data stream so as
to update those views.
12. Storing real time viewsStoring real time views
The underlying storage layer must meet the following requirements: -
Random reads : A realtime view should support fast random reads to
answer queries quickly.
Random writes : To support incremental algorithms, it must also be
possible to modify a realtime view with low latency.
Scalability : As with the serving layer views, the realtime views should
scale with the amount of data they store and the read/write rates required
by the application.
Fault tolerance : If a disk or a machine crashes, a realtime view should
continue to function normally.
13. Serving LayerServing Layer
In the Lambda Architecture, the serving layer provides low-latency
access to the results of calculations performed on the master
dataset. The serving layer views are slightly out of date due to the
time required for batch computation.
14. Requirements for a serving layer databaseRequirements for a serving layer database
Similar to speed layer these are following requirements: -
Random reads : A serving layer database must support random reads,
with indexes providing direct access to small portions of the view.
Batch writable : The batch views for a serving layer are produced from
scratch. When a new version of a view becomes available, it must be
possible to completely swap out the older version with the updated view.
Scalability : A serving layer database must be capable of handling views
of arbitrary size.
Fault tolerance : Because a serving layer database is distributed, it must
be tolerant of machine failures.
15. AgendaAgenda
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
16. Advantages of Lambda ArchitectureAdvantages of Lambda Architecture
These are following advantages of lambda architecture: -
Human fault tolerance : LA is provides human fault tolerance capability
to the Big data system.
Operational complexity : It resolved operational complexity issue of big
historical query by divide into precomputed query and on fly query.
Resilience : LA is fully resilience,because it is difficult for human errors or
hardware faults to corrupt data stored in the system since the system does
not allow update or delete operations in existing data.
Simple & Maintainable : It is simple in nature so we can easily
understand and it’s flexible architecture is helpful in maintainance.
17. AgendaAgenda
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
● What is Lambda Architecture ?
● Components of Lambda Architecture
● Advantages of Lambda Architecture
● Implementation with Spark and it's Benefits
● Code Review & Demo
18. Implementation with Spark and it's BenefitsImplementation with Spark and it's Benefits
There are following benefits to implement LA with Spark : -
➢ Spark gave us unified stack like Spark Core,Spark SQL,Spark
Streaming,Mllib, and GraphX, so that we can easily implement LA.
➢ Spark has clean and easy-to-use APIs (far more readable and with
less boilerplate code than MapReduce).
➢ Biggest advantage Spark gave us in this case is Spark Streaming,
which allowed us to re-use the same aggregates we wrote for our
batch application on a real-time data stream.
19. ReferencesReferences
Big Data Principles and best practices of scalable real-time data systems
Nathan Marz WITH James Warren
https://en.wikipedia.org/wiki/Lambda_architecture
https://www.mapr.com/developercentral/lambda-architecture
http://lambda-architecture.net/
1The batch layer runs functions over the master dataset to precompute intermediate data called batch views.
2.The speed layer compensates for the high latency of the batch layer by providing low-latency updates using data that has yet to be precomputed into a batch view.
3.Queries are then satisfied by processing data from the serving layer views and the speed layer views, and merging the results.
1
1.The batch layer runs functions over the master dataset to precompute intermediate data called batch views.
2. Batch Layer has three component:
1Master data set:- which is immuatble and append only data set.
2 precomputing function : it is generally a map reduce function which operates on master data set and produce batch view.precomputing functions are use for high latency query like historical queries.
3 Batch View: It is a outcome of precomputed function
1.The master dataset is the only part of the Lambda Architecture that absolutely must be safeguarded from corruption.
2.Data is raw : When designing your Big Data system, you want to be able to answer as many questions as possible. To do so we need to store raw data in master dataset because if we store normalized data then we have to lose many facts of data but again it depends on use case ,what level of rawness of data we require .
3 Data is immutable : In immutability we can not update or delete data ,we can only append data in dataset.
There are some vital advantages of it:
a)Human-fault tolerance: if by mistake we added any bad data in dataset and after some time found this is bad we just remove this bad data and recompute on master data set.
b)Simplicity:Immutable dataset is simple because it doesn’t required to store indexes like for mutable data.
4Data is eternally true: The key consequence of immutability is that each piece of data is true in perpetuity.That is, a piece of data, once true, must always be true. Immutability wouldn’t make sense without this property.
1.Performance:
a)RA:Requires computational effort to process the
entire master dataset.
b)IA:Requires less computational resources but may generate much larger batch views.
2.Human-fault tolerance:
a)RA:Extremely tolerant of human errors because
the batch views are continually rebuilt.
b)IA:Doesn’t facilitate repairing errors in the batch views; repairs are ad hoc and may require estimates.
3.Generality :
a)RA: Complexity of the algorithm is addressed during precomputation, resulting in simple batch
views and low-latency, on-the-fly processing.
b)IA:Requires special tailoring; may shift complexity to on-the-fly query processing.
4.Conclusion: So conclusion of both algorithm is.
a)RA:Essential to supporting a robust data-processing system.
b)IA:Can increase the efficiency of your system, but only as a supplement to recomputation algorithms.
1. As we know that the power of the Lambda Architecture lies in the separation of roles in the different layers.
2.Speed layer is one of core layer in this architecture.It fills the delta gap that is left by batch layer.that means combine speed layer view and batch view give us capibility fire any adhoc query on all data that is query=function(over all data).
3.We can also set Expiring realtime views example in memcache we set expiring time for key/value pairs.
1. Random reads: This means the data it contains must be indexed.
2 scalability: Typically this implies that realtime views can be distributed across many machines. Now days sharding technique is widely use to meet scalability requirement in database.
3. Fault tolerance : Fault tolerance is accomplished by replicating data across machines so there are backups should a single machine fail.
1. But this is not a concern, because the speed layer will be responsible for any data not yet available in the serving layer.
2.We can also write on fly query function in serving layer to give low latency query result.
1.Human fault tolerance:
a)If bug in batch job:Discard batch view and recompute it.
b) If bug in master data then just discard buggy data and re-process on old data.master dataset is immutable and append only dataset so we can easily discard buggy data.
c)If bug in query then re-deploy query layer.
2. In lambda Architecture we can use different alogorithm in each layar, like in batch layer we use Exact seach algorthm and in speed layer we can use Approximate seach algo.
3)Under the Lambda Architecture, results from the batch layer are always eventually consistent. As soon as a fresh batch update is completed, results from the batch layer are consistent.
1.Spark also provides in built common ML algorithms such as classification, regression, clustering, and collaborative filtering.
3.We didn’t need to re-implement the business logic, nor test and maintain a second code base.
4.