This talk centers on two things: a set of patterns for the architecture of high-scale data systems; and a framework for understanding the tradeoffs we make in designing them.
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
1. Patterns of the
Lambda Architecture
Truth and Lies at the Edge of Scale
Flip Kromer — CSC
I’m Flip Kromer, Distinguished Engineer at CSC. If you are a large enterprise company
looking to add Big Data capabilities — especially one involving legacy systems —
we’re a big, stable company that specializes in turning technology into an enterprise-
grade solution.
2. Pattern Set
This talk will equip you with two things.
One is patterns for how we design high-scale architectures to solve specific solution
cases
now that extra infrastructure is nearly free
5. Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
In this system, we have a whole ton of historical text, with more arriving all the time,
and want to allow immediate real-time search across the whole corpus.
6. Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Build
Main
Index
We will use a large periodic batch job to create indexes on the historical data.This
takes a while — far longer than our recency demands allow — so we might as well
have our elephants use clever algorithms and optimally organize the data for rapid
retrieval.
7. Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Update Recent Index
Until the next stampede arrives with an updated index, as each new record arrives
we will not only file it with the historical data but also use simple fast indexing to
make it immediately searchable. Merging new records directly would require stuffing
them into the right place in the historical index, which eventually means moving
records around, which demands far too much time and complexity to be workable.
8. Search w/ Update
Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Serve
Result
The system to serve the data just pulls from both indexes in immediate time
9. Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Lambda Architecture
Batch
Speed
Serving
We have a batch layer for the global corpus;
A speed layer for recent results;
and a serving layer for access
10. Build
Indexes
A Ton of
Text
Historical
Index
Live Indexer
More
Text
Recent
Index
API
Lambda Architecture
Global
Relevant
Immediate
We have a batch layer for the global corpus;
A speed layer for recent results;
and a serving layer for access
16. Lambda Arch Layers
• Batch layer Deep Global Truth throughput
• Speed layer Relevant Local Truth throughput
• Serving layer Rapid Retrieval latency
speed layer cares about throughput
Serving layer cares about latency,
20. Lambda Architecture
λ(v)
• Pure Function on immutable data
But really it means this new mindset of building pure function (lambda) on immutable
data,
22. Ideal Data System
• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Responsiveness -- Low latency for delivering results
• Justification -- Incorporates all relevant data into result
• Comprehensive -- Answer questions about any subject
• Recency -- Promptly incorporates changes in world
• Accuracy -- Few approximations or avoidable errors
The laziest, and therefore best, knobs are the Capacity/Affordability ones.The pre-big-
data era can be thought of as one where only those two exist. Big Data broke the
handle off the Capacity knob, either because Affordability ramps too fast or because
the speed of light starts threatening resilience, responsiveness or recency
* _Comprehensive_: complete; including all or nearly all elements or aspects of
something
* _concise_: giving a lot of information clearly and in a few words; brief but
23. Ideal Data System
• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Responsiveness -- Low latency for delivering results
• Justification -- Incorporates all relevant data into result
• Comprehensive -- Answer questions about any subject
• Recency -- Promptly incorporates changes in world
• Accuracy -- Few approximations or avoidable errors
You would think that what mattered was correctness — justified true belief
24. Ideal Data System
• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Responsiveness -- Low latency for delivering results
• Justification -- Incorporates all relevant data into result
• Comprehensive -- Answer questions about any subject
• Recency -- Promptly incorporates changes in world
• Accuracy -- Few approximations or avoidable errors
When you look at what we actually do, the non-negotiables are that it be manageable
and economic given that you must process arbitrarily large amounts of data
Truth is a nice-to-have.
26. At Scale
AND
THIS
THIS
AND TRY TO BE GOOD
Basically, given big data you have to accomodate any amount of data and produce
static reports or queries that execute within the duration of human patience — so
you must be fast and cheap, sacrificing good.
29. Pattern: Train / React
• Model of the world lets you make immediate decisions
• World changes slowly, so we can re-build model at leisure
• Relax: Recency
• Batch layer: Train a machine learning model
• Speed layer: Apply that model
• Examples: most Machine Learning thingies
(Recommender)
Big fat job that only needs to run occasionally; results of the job inform what happens
immediately
31. Pattern: Baseline / Delta
• Understanding the world takes a long time
• World changes much faster than that, and you care
• Relax: Simplicity, Accuracy
• Batch layer: Process the entire world
• Speed layer: Handle any changes since last big run
• Examples: Real-time Search index; Count Distinct;
other Approximate Stream Algorithms
In Train / React, the world changes, but slowly; training in batch mode is just fine
In Baseline / Delta, the world changes so quickly can’t run compute job fast enough
So you are sacrificing simplicity — there’s two systems where there was only one —
and accuracy — the recent records won’t update global normalized frequencies
33. Pagerank
48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
This next example has an importantly different flavor.
The core way that Google identifies important web pages is the “Pagerank”
algorithm, which basically says “a page is interesting if other interesting pages link to
it”.That’s recursive of course but the math works out.You can do similar things on a
social network like twitter to find spammers and superstars, or among college
football teams or world of warcraft players to prepare a competitive ranking, or
among buyers and sellers in a market to detect fraud.
To define a reputation ranking on say Twitter you simulate a game of multiple rounds.
34. 48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
9
4
-
5
-
New Record Appears
?
Doing this is kinda literally what Hadoop was born to do, and it’s a simple
Hadoop-101 level program.
Acting out all those rounds using every interaction we’ve ever seen takes a fair
amount of time, though, and so a problem comes when we meet a new person.
This new person accrues some reputational jellybeans, and we don’t want to live in
total ignorance of what their score is; and they dispatch some as well, which should
change the scores of those they follow.
35. 48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
9
4
-
5
-
Update Using Local
12÷3 = 4
24÷5 ≈ 5
9
Well, we can roughly guess the score of the new node by having their followers pay
out a jellybean share proportional to what they would have gotten in the last
pagerank round.
“A Guess beats a Blank Stare”
* World rate of change not really relevant
* The solution is actually to tell a lie
36. 48
24
42 12
12
6
24
24
42
6
6
6
6
6
6
6
9
4
-
5
-
…Ignoring Correctness
meh
But we’re not going to update the neighbors.You’d be concurrently updating an
arbitrary number of outbound nodes, and then of course those nodes’ changes
should rightfully propagate as well — this is why we play the multiple pagerank
rounds in the first place.
What we do instead is lie. Look, planes don’t fall out of the sky if you get someone’s
coolness quotient wrong in the first decimal point.
37. Batch Updates Graph
42
30
36 11
10
6
21
21
36
4
6
6
6
5 5
4
9
3
9
6
(A Guess beats a Blank)
This has an importantly different flavor
* World rate of change not really relevant
* The solution is actually to tell a lie
38. Pattern: World/Local
• Understanding the world needs full graph
• You can tell a little white lie reading immediate graph only
• Relaxing: Accuracy, Justification
• Batch layer: uses global graph information
• Speed layer: just reads immediate neighborhood
• Examples:“Whom to follow”, Clustering, anything at 2nd-
degree (friend-of-a-friend)
Problem isn’t so much about the volume of data,
it’s about how _far away_ that data is.
You can’t justify doing that second-order query for three reasons:
* time
* compute resources
* computational risk
39. Pattern: Guess Beats Blank
• You can’t calculate a good answer quickly
• But Comprehensiveness is a must
• Relaxing: Accuracy, Justification
• Batch layer: finds the correct answer
• Speed layer: makes a reasonable guess
• Examples:Any time the sparse-data case is also the most
valuable
In this case, we can’t sacrifice comprehensiveness — for every record that exists, we
must return a relevant answer. So we sacrifice truthfulness — or more precisely, we
sacrifice accuracy and justification.
40. Marine Corp’s 80% Rule
“Any decision made with
more than 80% of the
necessary information is
hesitation”
— “The Marine Corps Way”
Santamaria & Martino
When lots of data already, the imperfect result in the speed layer doesn’t have a huge
effect
When there isn’t much data, overwhelmingly better to fill in with an imperfect result
US Marine Corps:“Any decision made with more than 80% of the necessary
information is hesitation”
41. A Guess Beats a Blank
• You can’t calculate a good answer quickly
• But Comprehensiveness is a must
• Relaxing: Accuracy, Justification
• Batch layer: finds the correct answer
• Speed layer: makes a reasonable guess
• Examples:Any time the sparse-data case is also the most
valuable
In this case, we can’t sacrifice comprehensiveness — for every record that exists, we
must return a relevant answer. So we sacrifice truthfulness — or more precisely, we
sacrifice accuracy and justification.
43. Pattern: Slow Boil/Flash Fire
• Two tempos of data: months vs milliseconds
• Short-term data too much to store
• Long-term data too much to handle immediately
• Often accompanies Baseline / Deltas, Global / Local
• Examples:
• Trending Topics
• Insider Security
Global/Local:Why has a contractor sysadmin in Hawaii accessed powerpoint presos
from every single group within our organization?
46. Pattern: C-A-P Tradeoffs
• C-A-P tradeoffs:
• Can’t depend on when data will roll in (Justification)
• Can’t live in ignorance (Comprehensiveness)
• Batch layer: The final answer
• Speed layer: Actionable views
• Examples: Security (Authorization vs Auditing),
lots of counting problems
(Banking)
47. Pattern: Out-of-Order
• C-A-P tradeoffs:
• Can’t depend on when data will roll in (Justification)
• Can’t live in ignorance (Comprehensiveness)
• Batch layer: The final answer
• Speed layer: Actionable views
• Examples: Security (Authorization vs Auditing),
lots of counting problems
(Banking)
48. Common Theme
The System Asymptotes
to Truth over time
We keep seeing this common theme — you are building a system that approaches
correctness over time.This leads to a best practice that I’ll call the improver pattern:
50. • Scrapers: yield partial records
• Unifier: connects all identifiers for a common object
• Resolver: combines partial records into unified record
Entity Resolution
51. Pattern: Improver
• Improver:
function(best guess,{new facts}) ~> new best guess
• Batch layer: f(blank, {all facts}) ~> best possible guess
• Speed layer: f(current best, {new fact}) ~> new best guess
• Batch and speed layer share same code & contract,
asymptote to truth.
The way you build your resolver is such that it
52. Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Truth lives at the edge, not the middle
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
53. Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Approximate a pure function on all data
• What we do now that architecture is free
• Truth lives at the edge, not the middle
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
54. Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Approximate a pure function on all data
• What we do now that architecture is free
• Truth lives at the edge, not the middle
• Data is syndicated forward from arrival to serving
• “Query at write time”
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
55. • Lambda architecture isn’t about speed layer / batch layer.
• It's about
• moving truth to the edge, not the center;
• enabling fine-grained tradeoffs against fundamental limits;
• decoupling consumer from infrastructure
• decoupling consumer from asynchrony
• …with profound implications for how you build your teams
λ Arch: Truth, not Plumbing
This way of doing it simplifies architecture:
Local interactions only
Elimination of asynchrony
Which in turn profoundly simplifies development and operations
And allows you to structure team like you do the
56. Lambda Architecture
for a Dinky Little Blog
So far, talked about a bunch of reasons why you might be led **to** a lambda
architecture
And when there's a new technology people always first ask why they should do it
differently, which is a wise Thing to ask and a foolish thing to insist on
But let's look at it from the other end, from what life is like if this were the natural
state of being.
And to do so, let's take the most unjustifiable case for a high scale architecture: a blog
engine
57. Blog: Traditional Approach
• Familiar with the ORM Rails-style blog:
• Models: User, Article, Comment
• Views:
• /user/:id (user info, links to their articles and comments);
• /articles (list of articles);
• /articles/:id (article content, comments, author info)
58. User
id 3
name joeman
homepage http://…
photo http://…
bio “…”
Article
id 7
title The Crisis
body These are…
author_id 3
created_at 2014-08-08
Comment
id 12
body “lol”
article_id 7
author_id 3
59. Author Name
Author Bio Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua.
Author
Photo
Joe has written 2 Articles:
“A Post about My Cat”
Donec nec justo eget felis facilisis
fermentum.Aliquam porttitor mauris sit
amet orci.Aenean dignissim pellentesque
(… read more)
“Welcome to my blog”
Donec nec justo eget felis facilisis
fermentum.Aliquam porttitor mauris sit
amet orci.Aenean dignissim pellentesque
(… read more)
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
article show user show
61. Syndicate on Write
Δ
article Biographers
View
Fragments
showReportersΔ
user Biographers
Δ
com’t Biographers
articles
users
comments
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
62. • (…hack hack hack…)
/articles/v2/show.json
/articles/v1/show.json
• (…hack hack hack…)
What data model would you
like to receive? {“title”:”…”,
“body”:”…”,…}
lol um can I also have
Data Engineer Web Engineer
{“title”:”…”,
“body”:”…”,
“snippet”:…}
63. Syndicated Data
• The Data is always _there_
• …but sometimes it’s more perfect than other times.
64. Syndicated Data
• Reports are cheap, single-concern, and faithful to the view.
• You start thinking like the customer, not the database
• All pages render in O(1):
• Your imagination doesn’t have to fit inside a TCP timeout
• Data is immutable, flows are idempotent:
• Interface change is safe
• Data is always _there_,
• Asynchrony doesn’t affect consumers
• Everything is decoupled:
• Way harder to break everything
One of the worst pains in asses is the query that takes 1500 milliseconds. Needs to
be immediate, usually mission-critical, expensive in all ways
65. • Lambda architecture isn’t about speed layer / batch layer.
• It's about
• moving truth to the edge, not the center;
• enabling fine-grained tradeoffs against fundamental limits;
• decoupling consumer from infrastructure
• decoupling consumer from asynchrony
• …with profound implications for how you build your teams
λ Arch: Truth, not Plumbing
This way of doing it simplifies architecture:
Local interactions only
Elimination of asynchrony
Which in turn profoundly simplifies development and operations
And allows you to structure team like you do the
68. Changes update models
update
article
update
user
update
comments
Δ
article
Δ
user
Δ
com’nt
models
user
com’nt
article
history
Models stay the same: User, Article, Comment. Updated directly
Reporters can subscribe to models
On update, reporter receives updated object, and can do anything else it wants.
Typically, it's to create a new report
Reports live in the target domain: faithful to the data consumer. In this case, they look
very close to the information hierarchy of the rendered page
All pages render in O(1).Your imagination is not constrained by the length of a TCP
timeout
70. Serve Report Fragments
exp’d
article
compact
article
user’s #
articles
exp’d
user
sidebar
user
user’s #
comments
compact
comment
micro
user
show
article
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
71. Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
article show rendered
{
"title":"Article Title",
"body":"Article Body Lorem [...]",
"author":{ ... },
"comments: [
{"comment_id":1, "body":"First Post",...},
{"comment_id":2, "body":"lol",...},
...
]}
72. Serve Report Fragments
Article Title
Article Body Lorem ipsum dolor sit
amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident.
Author Name
Author
Photo
Author Bio
Lorem ipsum
dolor sit amet,
consectetur
adipiscing elit,
sed do eiusmod
tempor
Comments
"First Post"
(8/8/2014 by Commenter 1)
"lol"
(8/8/2014 by Commenter 2)
"No comment"
(8/8/2014 by Commenter 3)
exp’d
article
compact
article
user’s #
articles
exp’d
user
sidebar
user
user’s #
comments
compact
comment
micro
user
show
user
DB models are sole source of truth
Denormalized
Used directly by reader and writer
View is constructed from spare parts at read time
74. Two Big Ideas
• Fine-grained control over those architectural tradeoffs
• Truth lives at the edge, not the middle
Lets you trade off how quickly, how expensively, how true, how justified
New Paradigm for how, when and where we handle truth
82. Objections
• Three objections
1.Why hasn't it been done before
2.Architecture Astronaut
3.I'm not at high scale
• Response
1.Chef/Puppet/Docker/etc
2.Chef/Puppet/Docker/etc
3.Shut Up
83.
84. Objections
• Two APIs? Really?
• Yes. Guilty.That’s dumb and must be fixed.
• Spark or Samza, if you’re willing to only drink one flavor of
Kool-Aid
• EZbake.io, a CSC / 42six project to attack this
• …but we shouldn’t be living at the low level anyhow