O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
High 
Throughput 
Transactional 
Stream 
Processing 
Terence 
Yim 
(@chtyim)
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/acid-stream-processing 
In...
Presented at QCon San Francisco 
www.qconsf.com 
Purpose of QCon 
- to empower software development by facilitating the sp...
PROPRIETARY & CONFIDENTIAL 
Who 
We 
Are 
• Create 
open 
source 
software 
than 
provides 
simple 
access 
to 
powerful 
...
PROPRIETARY & CONFIDENTIAL 
Tigon 
Architecture 
• Basic 
unit 
of 
execution 
is 
called 
Flow 
• A 
Directed 
Acyclic 
G...
PROPRIETARY & CONFIDENTIAL 
Execution 
Model 
• Distributed 
mode 
• Runs 
on 
YARN 
through 
Apache 
Twill 
• One 
YARN 
...
PROPRIETARY & CONFIDENTIAL 
Flowlet 
• Processing 
unit 
of 
a 
Flow 
• Flowlets 
within 
a 
Flow 
are 
connected 
through...
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordSplitter extends AbstractFlowlet { 
private OutputEmitter<String...
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordCounter extends AbstractFlowlet { 
! 
@ProcessInput 
public void...
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordCountFlow implements Flow { 
@Override 
public FlowSpecification...
PROPRIETARY & CONFIDENTIAL 
Data 
Consistency 
• Node 
dies 
• Process 
method 
throws 
Exception 
• Transient 
IO 
issues...
PROPRIETARY & CONFIDENTIAL 
Data 
Consistency 
• Resume 
from 
failure 
by 
replaying 
queue 
• At 
least 
once 
• Data 
l...
PROPRIETARY & CONFIDENTIAL 
Flowlet 
Transaction 
Queue … 
Transaction 
to 
the 
rescue 
Flowlet 
… 
Flowlet 
HBase 
Table...
PROPRIETARY & CONFIDENTIAL 
Tigon 
and 
HBase 
• Tigon 
uses 
HBase 
heavily 
• Queues 
are 
implemented 
on 
HBase 
Table...
PROPRIETARY & CONFIDENTIAL 
Tephra 
on 
HBase 
• Tephra 
(http://tephra.io) 
• Brings 
ACID 
to 
HBase 
• Extends 
to 
mul...
PROPRIETARY & CONFIDENTIAL 
Flowlet 
Transaction 
• Transaction 
starts 
before 
dequeue 
• Following 
actions 
happen 
in...
PROPRIETARY & CONFIDENTIAL 
Distributed 
Transactional 
Queue 
• Persisted 
transactionally 
on 
HBase 
• One 
row 
per 
q...
PROPRIETARY & CONFIDENTIAL 
Transaction 
Failure 
• Rollback 
cost 
may 
be 
high, 
depends 
on 
what 
triggers 
the 
fail...
PROPRIETARY & CONFIDENTIAL 
Performance 
Tips 
• Runs 
more 
Flowlet 
instances 
• Dequeue 
Strategy 
• @HashPartition 
• ...
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordSplitter extends AbstractFlowlet { 
private OutputEmitter<String...
PROPRIETARY & CONFIDENTIAL 
Word 
Count 
public class WordCounter extends AbstractFlowlet { 
! 
@ProcessInput 
@Batch(100)...
PROPRIETARY & CONFIDENTIAL 
Summary 
• Real-­‐time 
stream 
processing 
framework 
• Exactly 
once 
processing 
guarantees...
PROPRIETARY & CONFIDENTIAL 
Road 
map 
• Partitioned 
queue 
• Better 
scalability, 
better 
performance 
• Preliminary 
t...
PROPRIETARY & CONFIDENTIAL 
Contributions 
• Web-­‐site: 
http://tigon.io 
• Tigon 
in 
CDAP: 
http://cdap.io 
• Source: 
...
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/acid-stream- 
processing
Próximos SlideShares
Carregando em…5
×

High Throughput Stream Processing with ACID Guarantees

740 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yIzNrG.

Terence Yim from Continuuity showcases a transactional stream processing system that supports full ACID properties without compromising scalability and high throughput. Filmed at qconsf.com.

Terence Yim has been a Apache Twill committer since its incubation. He is a Software Engineer at Continuuity, responsible for designing and building realtime processing systems on Hadoop/HBase.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

High Throughput Stream Processing with ACID Guarantees

  1. 1. High Throughput Transactional Stream Processing Terence Yim (@chtyim)
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /acid-stream-processing InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. PROPRIETARY & CONFIDENTIAL Who We Are • Create open source software than provides simple access to powerful technologies • Cask Data Application Platform (http://cdap.io) • A platform runs on top of Hadoop to provide data and application virtualization • Virtualization of data through logical representations of underlying data • Virtualization of applications through application containers • Services and tools that enable faster application development and better operational control in production • Coopr (http://coopr.io) • Clusters with a Click • Self-­‐service, template-­‐based cluster provisioning system
  5. 5. PROPRIETARY & CONFIDENTIAL Tigon Architecture • Basic unit of execution is called Flow • A Directed Acyclic Graph (DAG) of Flowlets • Multiple modes of execution • Standalone • Useful for testing • Distributed • Fault tolerant, scalable Flowlet Flowlet Flowlet Flowlet TigonSQL Flowlet Events Tigon Architecture STANDALONE Threads In Memory Queues DISTRIBUTED YARN Containers HBase Tables
  6. 6. PROPRIETARY & CONFIDENTIAL Execution Model • Distributed mode • Runs on YARN through Apache Twill • One YARN container per one flowlet instance • One active thread per flowlet instance • Flowlet instances can scale dynamically and independently • No need to stop Flow • Standalone mode • Single JVM • One thread per flowlet instance • Queues are in-­‐memory, not really persisted
  7. 7. PROPRIETARY & CONFIDENTIAL Flowlet • Processing unit of a Flow • Flowlets within a Flow are connected through Distributed Queue • Consists of one or more Process Method(s) • User defined Java method • No restriction on what can be done in the method • A Process Method in a Flowlet can be triggered by • Dequeue objects emitted by upstream flowlet(s) • Repeatedly triggered by time delay • Useful for polling external data (Twitter Firehose, Kafka, …) • Inside Process Method, you can emit objects for downstream Flowlet(s)
  8. 8. PROPRIETARY & CONFIDENTIAL Word Count public class WordSplitter extends AbstractFlowlet { private OutputEmitter<String> output; ! @Tick(delay=100, unit=TimeUnit.MILLISECONDS) public void poll() { // Poll tweets from Twitter Firehose // ... for (String word : tweet.split("s+")) { output.emit(word); } } }
  9. 9. PROPRIETARY & CONFIDENTIAL Word Count public class WordCounter extends AbstractFlowlet { ! @ProcessInput public void process(String word) { // Increments count for the word in HBase } }
  10. 10. PROPRIETARY & CONFIDENTIAL Word Count public class WordCountFlow implements Flow { @Override public FlowSpecification configure() { return FlowSpecification.Builder.with() .setName("WordCountFlow") .setDescription("Flow for counting words) .withFlowlets() .add("splitter", new WordSplitter()) .add("counter", new WordCounter()) .connect() .from("splitter").to("counter") .build(); } }
  11. 11. PROPRIETARY & CONFIDENTIAL Data Consistency • Node dies • Process method throws Exception • Transient IO issues (e.g. connection timeout) • Conflicting updates • Writes to the same cell from two instances
  12. 12. PROPRIETARY & CONFIDENTIAL Data Consistency • Resume from failure by replaying queue • At least once • Data logic be idempotent • Program handles rollback / skipping • At most once • Lossy computation • Exactly once • Ideal model for data consistency as if failure doesn’t occurred ! • How about data already been written to backing store?
  13. 13. PROPRIETARY & CONFIDENTIAL Flowlet Transaction Queue … Transaction to the rescue Flowlet … Flowlet HBase Table Isolated Read Write conflicts Queue … Flowlet
  14. 14. PROPRIETARY & CONFIDENTIAL Tigon and HBase • Tigon uses HBase heavily • Queues are implemented on HBase Tables • Optionally integrated with HBase as user data stores ! • HBase has limited support on transaction • Has atomic cell operations • Has atomic batch operations on rows within the same region • NO cross region atomic operations • NO cross table atomic operations • NO multi-­‐RPC atomic operations
  15. 15. PROPRIETARY & CONFIDENTIAL Tephra on HBase • Tephra (http://tephra.io) • Brings ACID to HBase • Extends to multi-­‐rows, multi-­‐regions, multi-­‐tables • Multi-­‐Version Concurrency Control • Cell version = Transaction ID • All writes in the same transaction use the same transaction ID as version • Reads isolation by excluding uncommitted transactions • Optimistic Concurrency Control • Conflict detection at commit time • No locking, hence no deadlock • Performs good if conflicts happens rarely
  16. 16. PROPRIETARY & CONFIDENTIAL Flowlet Transaction • Transaction starts before dequeue • Following actions happen in the same transaction • Dequeue • Invoke Process Method • States updates • Only if updates are integrated with Tephra (e.g. Queue and TransactionAwareHTable in Tephra) • Enqueue • Transaction failure will trigger rollback • Exception thrown from Process Method • Write conflicts
  17. 17. PROPRIETARY & CONFIDENTIAL Distributed Transactional Queue • Persisted transactionally on HBase • One row per queue entry • Enqueue • Batch updates at commit time • Commits together with user updates • Dequeue • Scans for uncommitted entries • Marks entries as processed on commit • Coprocessor • Skipping committed entries on dequeue scan • Cleanup consumed entries on flush/compact
  18. 18. PROPRIETARY & CONFIDENTIAL Transaction Failure • Rollback cost may be high, depends on what triggers the failure • User exception • Most likely low as most changes are still in local buffer • Write conflicts • Relatively low if conflicts are detected before persisting • High if changes are persisted and conflicts found during the commit phase • Flowlet optionally implements the Callback interface to intercept transaction failure • Decide either retry or abort the failed transaction • Default is to retry with limited number of times (Optimistic Concurrency Control) • Max retries is setting through the @ProcessInput annotation.
  19. 19. PROPRIETARY & CONFIDENTIAL Performance Tips • Runs more Flowlet instances • Dequeue Strategy • @HashPartition • Hash on the write key to avoid write conflicts • Batch dequeue • Use @Batch annotation on Process Method • More entries will be processed in one transaction • Minimize IO and transaction overhead
  20. 20. PROPRIETARY & CONFIDENTIAL Word Count public class WordSplitter extends AbstractFlowlet { private OutputEmitter<String> output; ! @Tick(delay=100, unit=TimeUnit.MILLISECONDS) public void poll() { // Poll tweets from Twitter Firehose // ... for (String word : tweet.split("s+")) { output.emit(word, "key", word); // Hash by the word } } }
  21. 21. PROPRIETARY & CONFIDENTIAL Word Count public class WordCounter extends AbstractFlowlet { ! @ProcessInput @Batch(100) @HashPartition("key") public void process(String word) { // ... } }
  22. 22. PROPRIETARY & CONFIDENTIAL Summary • Real-­‐time stream processing framework • Exactly once processing guarantees • Transaction message queue on Apache HBase • Transactional storage integration • Through Tephra transaction engine • Executes on Hadoop YARN • Through Apache Twill • Simple Java Programmatic API • Imperative programming • Data model through Java Object
  23. 23. PROPRIETARY & CONFIDENTIAL Road map • Partitioned queue • Better scalability, better performance • Preliminary tests shows 100K events/sec on 8 nodes cluster with 10 flowlet instances • Linearly scalable • Drain / cleanup queue • Better controls for upgrade • Supports more programming languages • External logging and metrics system integration • More source Flowlet types • Kafka, Twitter, Flume…
  24. 24. PROPRIETARY & CONFIDENTIAL Contributions • Web-­‐site: http://tigon.io • Tigon in CDAP: http://cdap.io • Source: https://www.github.com/caskdata/tigon • Mailing lists • tigon-­‐dev@googlegroups.com • tigon-­‐user@googlegroups.com • JIRA • http://issues.cask.co/browse/TIGON
  25. 25. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/acid-stream- processing

×