SlideShare uma empresa Scribd logo
1 de 31
Stream processing with R and Amazon Kinesis
Big Data Day LA 2016
Gergely Daroczi
@daroczig
July 9 2016
About me
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 2 / 28
CARD.com’s View of the World
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 3 / 28
CARD.com’s View of the World
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 4 / 28
Modern Marketing at CARD.com
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 5 / 28
Further Data Partners
card transaction processors
card manufacturers
CIP/KYC service providers
online ad platforms
remarketing networks
licensing partners
communication engines
others
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 6 / 28
Infrastructure
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 7 / 28
Why R?
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 8 / 28
Infrastructure
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 9 / 28
The Classic Era
Source: SnowPlow
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 10 / 28
The Hybrid Era
Source: SnowPlow
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 11 / 28
The Unified Era
Source: SnowPlow
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 12 / 28
Why Amazon Kinesis & Redshift?
Source: Kinesis Product Details
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 13 / 28
Intro to Amazon Kinesis Streams
Source: Kinesis Developer Guide
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 14 / 28
Intro to Amazon Kinesis Shards
Source: AWS re:Invent 2013
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 15 / 28
How to Communicate with Kinesis
Writing data to the stream:
Amazon Kinesis Streams API (!)
Amazon Kinesis Producer Library (KPL) from Java
flume-kinesis
Amazon Kinesis Agent
Reading data from the stream:
Amazon Kinesis Streams API (!)
Amazon Kinesis Client Library (KCL) from Java, Node.js, .NET,
Python, Ruby
Managing streams:
Amazon Kinesis Streams API (!)
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 16 / 28
Now We Need an R Client!
Initialize connection to the SDK:
> library(rJava)
+ .jinit(classpath =
+ list.files(
+ '~/Projects/kineric/inst/java/',
+ full.names = TRUE),
+ parameters = "-Xmx512m")
What methods do we have there?
> kc <- .jnew('com.amazonaws.services.kinesis.AmazonKinesisClient')
> .jmethods(kc)
[1] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(java
[2] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(com.
[3] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.removeTagsFromS
[4] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[5] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[6] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[7] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se
[8] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona
[9] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona
[10] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 17 / 28
The First Steps from R
Set API endpoint:
> kc$setEndpoint('kinesis.us-west-2.amazonaws.com', 'kinesis', 'us-west-2')
What streams do we have access to?
> kc$listStreams()
[1] "Java-Object{{StreamNames: [test_kinesis],HasMoreStreams: false}}"
> kc$listStreams()$getStreamNames()
[1] "Java-Object{[test_kinesis]}"
Describe this stream:
> (kcstream <- kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription()
[1] "Java-Object{{StreamName: test_kinesis,StreamARN: arn:aws:kinesis:us-west-2:5957
> kcstream$getStreamName()
[1] "test_kinesis"
> kcstream$getStreamStatus()
[1] "ACTIVE"
> kcstream$getShards()
[1] "Java-Object{[
{ShardId: shardId-000000000000,HashKeyRange: {
StartingHashKey: 0,EndingHashKey: 17014118346046923173168730371588410572324},Sequenc
{ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 18 / 28
Writing to the Stream
> prr <- .jnew('com.amazonaws.services.kinesis.model.PutRecordRequest')
> prr$setStreamName('test_kinesis')
> prr$setData(J('java.nio.ByteBuffer')$wrap(.jbyte(charToRaw('foobar'))))
> prr$setPartitionKey('42')
> prr
[1] "Java-Object{{StreamName: test_kinesis,
Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6],
PartitionKey: 42,}}"
> kc$putRecord(prr)
[1] "Java-Object{{ShardId: shardId-000000000001,
SequenceNumber: 49562894160449444332153346371084313572324361665031176210}}"
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 19 / 28
Get Records from the Stream
First, we need a shared iterator, which expires in 5 minutes:
> sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest')
> sir$setStreamName('test_kinesis')
> sir$setShardId(.jnew('java/lang/String', '0'))
> sir$setShardIteratorType('TRIM_HORIZON')
> iterator <- kc$getShardIterator(sir)$getShardIterator()
Then we can use it to get records:
> gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest')
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[]}"
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
Get Records from the Stream
First, we need a shared iterator, which expires in 5 minutes:
> sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest')
> sir$setStreamName('test_kinesis')
> sir$setShardId(.jnew('java/lang/String', '0'))
> sir$setShardIteratorType('TRIM_HORIZON')
> iterator <- kc$getShardIterator(sir)$getShardIterator()
Then we can use it to get records:
> gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest')
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[]}"
The data was pushed to the other shard:
> sir$setShardId(.jnew('java/lang/String', '1'))
> iterator <- kc$getShardIterator(sir)$getShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503
ApproximateArrivalTimestamp: Tue Jun 14 09:40:19 CEST 2016,
Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6],
PartitionKey: 42}]}"Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
Get Further Records from the Stream
> sapply(kc$getRecords(gir)$getRecords(),
+ function(x)
+ rawToChar(x$getData()$array()))
[1] "foobar"
> iterator <- kc$getRecords(gir)$getNextShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[]}"
Get the oldest available data again:
> iterator <- kc$getShardIterator(sir)$getShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503
Or we can refer to the last known sequence number as well:
> sir$setShardIteratorType('AT_SEQUENCE_NUMBER')
> sir$setStartingSequenceNumber('495628941604494443321533463710843135723243616650311
> iterator <- kc$getShardIterator(sir)$getShardIterator()
> gir$setShardIterator(iterator)
> kc$getRecords(gir)$getRecords()
[1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 21 / 28
Managing Shards
No need for both shards:
> ms <- .jnew('com.amazonaws.services.kinesis.model.MergeShardsRequest')
> ms$setShardToMerge('shardId-000000000000')
> ms$setAdjacentShardToMerge('shardId-000000000001')
> ms$setStreamName('test_kinesis')
> kc$mergeShards(ms)
What do we have now?
> kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription()$getShards()
[1] "Java-Object{[
{ShardId: shardId-000000000000,HashKeyRange: {StartingHashKey: 0,EndingHashKey: 1701
SequenceNumberRange: {
StartingSequenceNumber: 49562894160427143586954815717376297430913467927668719618,
EndingSequenceNumber: 49562894160438293959554081028945856364232263390243848194}},
{ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731
SequenceNumberRange: {
StartingSequenceNumber: 49562894160449444332153346340517833149186116289174700050,
EndingSequenceNumber: 49562894160460594704752611652087392082504911751749828626}},
{ShardId: shardId-000000000002,
ParentShardId: shardId-000000000000,
AdjacentParentShardId: shardId-000000000001,
HashKeyRange: {StartingHashKey: 0,EndingHashKey: 34028236692093846346337460743176821
SequenceNumberRange: {StartingSequenceNumber: 49562904991497673099704924344727019527
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 22 / 28
Next Steps
Ideas:
R function managing/scaling shards and attaching R processors
One R processor per shard (parallel threads on a node or cluster)
Store started/finished sequence number in memory/DB (checkpointing)
Example (Design for a simple MVP on a single node)
mcparallel starts parallel R processes to evaluate the given expression
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
Next Steps
Ideas:
R function managing/scaling shards and attaching R processors
One R processor per shard (parallel threads on a node or cluster)
Store started/finished sequence number in memory/DB (checkpointing)
Example (Design for a simple MVP on a single node)
mcparallel starts parallel R processes to evaluate the given expression
Problems:
duplicated records after failure due to in-memory checkpoint
needs a DB for running on a cluster
hard to write good unit tests
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
Next Steps
Ideas:
R function managing/scaling shards and attaching R processors
One R processor per shard (parallel threads on a node or cluster)
Store started/finished sequence number in memory/DB (checkpointing)
Example (Design for a simple MVP on a single node)
mcparallel starts parallel R processes to evaluate the given expression
Problems:
duplicated records after failure due to in-memory checkpoint
needs a DB for running on a cluster
hard to write good unit tests
Why not use KCL then?
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
Amazon Kinesis Client Library
An easy-to-use programming model for processing data
java -cp amazon-kinesis-client-1.6.1.jar 
com.amazonaws.services.kinesis.multilang.MultiLangDaemon 
test-kinesis.properties
Scale-out and fault-tolerant processing (checkpointing via DynamoDB)
Logging and metrics in CloudWatch
The MultiLangDaemon spawns processes written in any language,
communication happens via JSON messages sent over stdin/stdout
Only a few events/methods to care about in the consumer application:
1 initialize
2 processRecords
3 checkpoint
4 shutdown
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 24 / 28
Messages from the KCL
1 initialize:
Perform initialization steps
Write “status” message to indicate you are done
Begin reading line from STDIN to receive next action
2 processRecords:
Perform processing tasks (you may write a checkpoint message at any
time)
Write “status” message to STDOUT to indicate you are done.
Begin reading line from STDIN to receive next action
3 shutdown:
Perform shutdown tasks (you may write a checkpoint message at any
time)
Write “status” message to STDOUT to indicate you are done.
Begin reading line from STDIN to receive next action
4 checkpoint:
Decide whether to checkpoint again based on whether there is an error
or not.
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 25 / 28
Again: Why R?
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 26 / 28
R Script Interacting with KCL
#!/usr/bin/r -i
while (TRUE) {
## read and parse messages
line <- fromJSON(readLines(n = 1))
## nothing to do unless we receive a record to process
if (line$action == 'processRecords') {
## process each record
lapply(line$records, function(r) {
business_logic(r)
cat(toJSON(list(action = 'checkpoint', checkpoint = r$sequenceNumber)))
})
}
## return response in JSON
cat(toJSON(list(action = 'status', responseFor = line$action)))
}
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 27 / 28
Examples
Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 28 / 28

Mais conteúdo relacionado

Destaque

Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochiredgang
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Data Con LA
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Data Con LA
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.The Social Executive
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind themThe Social Executive
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
 
Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopMichelle Ufford
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityData Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 

Destaque (20)

Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochi
 
Dot pab forum september 2011
Dot pab forum september 2011Dot pab forum september 2011
Dot pab forum september 2011
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Data Warehousing Patterns for Hadoop
Data Warehousing Patterns for HadoopData Warehousing Patterns for Hadoop
Data Warehousing Patterns for Hadoop
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate UniversityBig Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 

Mais de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Big Data Day LA 2016/ Data Science Track - Stream processing with R and Amazon Kinesis, Gergely Daroczi, Director of Analytics, CARD.com

  • 1. Stream processing with R and Amazon Kinesis Big Data Day LA 2016 Gergely Daroczi @daroczig July 9 2016
  • 2. About me Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 2 / 28
  • 3. CARD.com’s View of the World Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 3 / 28
  • 4. CARD.com’s View of the World Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 4 / 28
  • 5. Modern Marketing at CARD.com Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 5 / 28
  • 6. Further Data Partners card transaction processors card manufacturers CIP/KYC service providers online ad platforms remarketing networks licensing partners communication engines others Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 6 / 28
  • 7. Infrastructure Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 7 / 28
  • 8. Why R? Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 8 / 28
  • 9. Infrastructure Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 9 / 28
  • 10. The Classic Era Source: SnowPlow Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 10 / 28
  • 11. The Hybrid Era Source: SnowPlow Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 11 / 28
  • 12. The Unified Era Source: SnowPlow Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 12 / 28
  • 13. Why Amazon Kinesis & Redshift? Source: Kinesis Product Details Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 13 / 28
  • 14. Intro to Amazon Kinesis Streams Source: Kinesis Developer Guide Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 14 / 28
  • 15. Intro to Amazon Kinesis Shards Source: AWS re:Invent 2013 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 15 / 28
  • 16. How to Communicate with Kinesis Writing data to the stream: Amazon Kinesis Streams API (!) Amazon Kinesis Producer Library (KPL) from Java flume-kinesis Amazon Kinesis Agent Reading data from the stream: Amazon Kinesis Streams API (!) Amazon Kinesis Client Library (KCL) from Java, Node.js, .NET, Python, Ruby Managing streams: Amazon Kinesis Streams API (!) Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 16 / 28
  • 17. Now We Need an R Client! Initialize connection to the SDK: > library(rJava) + .jinit(classpath = + list.files( + '~/Projects/kineric/inst/java/', + full.names = TRUE), + parameters = "-Xmx512m") What methods do we have there? > kc <- .jnew('com.amazonaws.services.kinesis.AmazonKinesisClient') > .jmethods(kc) [1] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(java [2] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.splitShard(com. [3] "public void com.amazonaws.services.kinesis.AmazonKinesisClient.removeTagsFromS [4] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [5] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [6] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [7] "public com.amazonaws.services.kinesis.model.ListStreamsResult com.amazonaws.se [8] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona [9] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona [10] "public com.amazonaws.services.kinesis.model.GetShardIteratorResult com.amazona Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 17 / 28
  • 18. The First Steps from R Set API endpoint: > kc$setEndpoint('kinesis.us-west-2.amazonaws.com', 'kinesis', 'us-west-2') What streams do we have access to? > kc$listStreams() [1] "Java-Object{{StreamNames: [test_kinesis],HasMoreStreams: false}}" > kc$listStreams()$getStreamNames() [1] "Java-Object{[test_kinesis]}" Describe this stream: > (kcstream <- kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription() [1] "Java-Object{{StreamName: test_kinesis,StreamARN: arn:aws:kinesis:us-west-2:5957 > kcstream$getStreamName() [1] "test_kinesis" > kcstream$getStreamStatus() [1] "ACTIVE" > kcstream$getShards() [1] "Java-Object{[ {ShardId: shardId-000000000000,HashKeyRange: { StartingHashKey: 0,EndingHashKey: 17014118346046923173168730371588410572324},Sequenc {ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 18 / 28
  • 19. Writing to the Stream > prr <- .jnew('com.amazonaws.services.kinesis.model.PutRecordRequest') > prr$setStreamName('test_kinesis') > prr$setData(J('java.nio.ByteBuffer')$wrap(.jbyte(charToRaw('foobar')))) > prr$setPartitionKey('42') > prr [1] "Java-Object{{StreamName: test_kinesis, Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6], PartitionKey: 42,}}" > kc$putRecord(prr) [1] "Java-Object{{ShardId: shardId-000000000001, SequenceNumber: 49562894160449444332153346371084313572324361665031176210}}" Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 19 / 28
  • 20. Get Records from the Stream First, we need a shared iterator, which expires in 5 minutes: > sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest') > sir$setStreamName('test_kinesis') > sir$setShardId(.jnew('java/lang/String', '0')) > sir$setShardIteratorType('TRIM_HORIZON') > iterator <- kc$getShardIterator(sir)$getShardIterator() Then we can use it to get records: > gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest') > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[]}" Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
  • 21. Get Records from the Stream First, we need a shared iterator, which expires in 5 minutes: > sir <- .jnew('com.amazonaws.services.kinesis.model.GetShardIteratorRequest') > sir$setStreamName('test_kinesis') > sir$setShardId(.jnew('java/lang/String', '0')) > sir$setShardIteratorType('TRIM_HORIZON') > iterator <- kc$getShardIterator(sir)$getShardIterator() Then we can use it to get records: > gir <- .jnew('com.amazonaws.services.kinesis.model.GetRecordsRequest') > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[]}" The data was pushed to the other shard: > sir$setShardId(.jnew('java/lang/String', '1')) > iterator <- kc$getShardIterator(sir)$getShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503 ApproximateArrivalTimestamp: Tue Jun 14 09:40:19 CEST 2016, Data: java.nio.HeapByteBuffer[pos=0 lim=6 cap=6], PartitionKey: 42}]}"Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 20 / 28
  • 22. Get Further Records from the Stream > sapply(kc$getRecords(gir)$getRecords(), + function(x) + rawToChar(x$getData()$array())) [1] "foobar" > iterator <- kc$getRecords(gir)$getNextShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[]}" Get the oldest available data again: > iterator <- kc$getShardIterator(sir)$getShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503 Or we can refer to the last known sequence number as well: > sir$setShardIteratorType('AT_SEQUENCE_NUMBER') > sir$setStartingSequenceNumber('495628941604494443321533463710843135723243616650311 > iterator <- kc$getShardIterator(sir)$getShardIterator() > gir$setShardIterator(iterator) > kc$getRecords(gir)$getRecords() [1] "Java-Object{[{SequenceNumber: 4956289416044944433215334637108431357232436166503 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 21 / 28
  • 23. Managing Shards No need for both shards: > ms <- .jnew('com.amazonaws.services.kinesis.model.MergeShardsRequest') > ms$setShardToMerge('shardId-000000000000') > ms$setAdjacentShardToMerge('shardId-000000000001') > ms$setStreamName('test_kinesis') > kc$mergeShards(ms) What do we have now? > kc$describeStream(StreamName = 'test_kinesis')$getStreamDescription()$getShards() [1] "Java-Object{[ {ShardId: shardId-000000000000,HashKeyRange: {StartingHashKey: 0,EndingHashKey: 1701 SequenceNumberRange: { StartingSequenceNumber: 49562894160427143586954815717376297430913467927668719618, EndingSequenceNumber: 49562894160438293959554081028945856364232263390243848194}}, {ShardId: shardId-000000000001,HashKeyRange: {StartingHashKey: 170141183460469231731 SequenceNumberRange: { StartingSequenceNumber: 49562894160449444332153346340517833149186116289174700050, EndingSequenceNumber: 49562894160460594704752611652087392082504911751749828626}}, {ShardId: shardId-000000000002, ParentShardId: shardId-000000000000, AdjacentParentShardId: shardId-000000000001, HashKeyRange: {StartingHashKey: 0,EndingHashKey: 34028236692093846346337460743176821 SequenceNumberRange: {StartingSequenceNumber: 49562904991497673099704924344727019527 Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 22 / 28
  • 24. Next Steps Ideas: R function managing/scaling shards and attaching R processors One R processor per shard (parallel threads on a node or cluster) Store started/finished sequence number in memory/DB (checkpointing) Example (Design for a simple MVP on a single node) mcparallel starts parallel R processes to evaluate the given expression Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
  • 25. Next Steps Ideas: R function managing/scaling shards and attaching R processors One R processor per shard (parallel threads on a node or cluster) Store started/finished sequence number in memory/DB (checkpointing) Example (Design for a simple MVP on a single node) mcparallel starts parallel R processes to evaluate the given expression Problems: duplicated records after failure due to in-memory checkpoint needs a DB for running on a cluster hard to write good unit tests Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
  • 26. Next Steps Ideas: R function managing/scaling shards and attaching R processors One R processor per shard (parallel threads on a node or cluster) Store started/finished sequence number in memory/DB (checkpointing) Example (Design for a simple MVP on a single node) mcparallel starts parallel R processes to evaluate the given expression Problems: duplicated records after failure due to in-memory checkpoint needs a DB for running on a cluster hard to write good unit tests Why not use KCL then? Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 23 / 28
  • 27. Amazon Kinesis Client Library An easy-to-use programming model for processing data java -cp amazon-kinesis-client-1.6.1.jar com.amazonaws.services.kinesis.multilang.MultiLangDaemon test-kinesis.properties Scale-out and fault-tolerant processing (checkpointing via DynamoDB) Logging and metrics in CloudWatch The MultiLangDaemon spawns processes written in any language, communication happens via JSON messages sent over stdin/stdout Only a few events/methods to care about in the consumer application: 1 initialize 2 processRecords 3 checkpoint 4 shutdown Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 24 / 28
  • 28. Messages from the KCL 1 initialize: Perform initialization steps Write “status” message to indicate you are done Begin reading line from STDIN to receive next action 2 processRecords: Perform processing tasks (you may write a checkpoint message at any time) Write “status” message to STDOUT to indicate you are done. Begin reading line from STDIN to receive next action 3 shutdown: Perform shutdown tasks (you may write a checkpoint message at any time) Write “status” message to STDOUT to indicate you are done. Begin reading line from STDIN to receive next action 4 checkpoint: Decide whether to checkpoint again based on whether there is an error or not. Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 25 / 28
  • 29. Again: Why R? Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 26 / 28
  • 30. R Script Interacting with KCL #!/usr/bin/r -i while (TRUE) { ## read and parse messages line <- fromJSON(readLines(n = 1)) ## nothing to do unless we receive a record to process if (line$action == 'processRecords') { ## process each record lapply(line$records, function(r) { business_logic(r) cat(toJSON(list(action = 'checkpoint', checkpoint = r$sequenceNumber))) }) } ## return response in JSON cat(toJSON(list(action = 'status', responseFor = line$action))) } Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 27 / 28
  • 31. Examples Gergely Daroczi (@daroczig) Stream processing with R and Kinesis Big Data Day LA 2016 28 / 28