Mais conteúdo relacionado Semelhante a Apache Incubator Samza: Stream Processing at LinkedIn (20) Apache Incubator Samza: Stream Processing at LinkedIn16. Real-time Feeds
• 10+ billion writes per day
• 172,000 messages per second (average)
• 55+ billion messages per day to real-time
consumers
17. Stream Processing is Hard
•
•
•
•
•
•
Partitioning
State
Re-processing
Failure semantics
Joins to services or database
Non-determinism
28. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
29. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
30. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
31. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
32. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
33. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
34. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
35. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
36. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
37. Tasks
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
39. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Partition 0
Partition 1
Output Count Stream
40. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Partition 0
Partition 1
Output Count Stream
41. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Partition 0
Partition 1
Output Count Stream
42. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Output Count Stream
Partition 0
Partition 1
43. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Output Count Stream
Partition 0
Partition 1
44. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Output Count Stream
Partition 0
Partition 1
45. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Output Count Stream
Partition 0
Partition 1
46. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Output Count Stream
Partition 0
Partition 1
47. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
48. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
49. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
50. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
51. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
52. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
53. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
54. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
55. Tasks
Page Views - Partition 0
1
2
3
4
PageKeyViews
CounterTask
Checkpoint
Stream
2
Output Count Stream
Partition 1
Partition 0
Partition 1
66. YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
67. YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-host/jobs/download/my.tgz
68. YARN
You: I want to run command X on two machines with
512M of memory.
YARN: Cool, where’s your code?
You: http://some-host/jobs/download/my.tgz
YARN: I’ve run your command on grid-node-2 and
grid-node-7.
108. Remote RPC is slow
• Stream: ~500k records/sec/container
• DB: << less
110. No undo
• Database state is non-deterministic
• Can’t roll back mutations if task crashes
127. Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, String> store;
public void init(Config config, TaskContext context) {
this.store = context.getStore("mystore");
}
public void process(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = (GenericRecord) envelope.getMessage();
String memberId = record.get("member_id");
String name = record.get("name");
System.out.println("old name: " + store.get(memberId));
store.put(memberId, name);
}
}
128. Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, String> store;
public void init(Config config, TaskContext context) {
this.store = context.getStore("mystore");
}
public void process(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = (GenericRecord) envelope.getMessage();
String memberId = record.get("member_id");
String name = record.get("name");
System.out.println("old name: " + store.get(memberId));
store.put(memberId, name);
}
}
129. Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, String> store;
public void init(Config config, TaskContext context) {
this.store = context.getStore("mystore");
}
public void process(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = (GenericRecord) envelope.getMessage();
String memberId = record.get("member_id");
String name = record.get("name");
System.out.println("old name: " + store.get(memberId));
store.put(memberId, name);
}
}
130. Stateful Stream Task
public class SimpleStatefulTask implements StreamTask, InitableTask {
private KeyValueStore<String, String> store;
public void init(Config config, TaskContext context) {
this.store = context.getStore("mystore");
}
public void process(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = (GenericRecord) envelope.getMessage();
String memberId = record.get("member_id");
String name = record.get("name");
System.out.println("old name: " + store.get(memberId));
store.put(memberId, name);
}
}
132. Let’s be Friends!
• We are incubating, and you can help!
• Get up and running in 5 minutes
http://bit.ly/hello-samza
• Grab some newbie JIRAs
http://bit.ly/samza_newbie_issues
Notas do Editor - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported. - stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported. - compute top shares, pull in, scrape, entity tag- language detection- send emails: friend was in the news- requirement: has to be fast, since news is trendy - relevance pipeline - we send relatively data rich emails- some emails are time sensitive (need to be sent soon) - time sensitive- data ingestion pattern- other systems that follow this pattern: realtimeolap system, and social graph system - ecosystem at LinkedIn (some unique traits)- hard unsolved problems in this space - once we had all this data in kafka, we wanted to do stuff with it.- persistent,reliable,distributed,message queue- Kafka = first among equals, but stream systems are pluggable. Just like Hadoop with HDSF vs. S3. - started with just simple web service that consumes and produces kafka messages.- realized that there are a lot of hard problems that needed to be solved.- reprocessing: what if my algorithm changes and I need to reprocess all events?- non-determinism: queries to external systems, time dependencies, ordering of messages. - open area of research- been around for 20 years partitioned re-playable,ordered,fault tolerant,infinitevery heavyweight definition of a stream (vs. s4, storm, etc) partition assignment happens on write At least once messaging. Duplicates are possible.Future: exact semantics.Transparent to user. No ack’ing API. connected by stream name onlyfully buffered split job tracker upresource management, process isolation, fault tolerance, security - group by, sum, count - stream to stream, stream to table, table to table - buffered sorting Changelog/redologState machine model Can also consume these streams from other jobs. - can’t keep messages forever. - log compaction: delete over-written keys over time. - can’t keep messages forever. - log compaction: delete over-written keys over time. store API is pluggable: Lucene, buffered sort, external sort, bitmap index, bloom filters and sketches