NASA LandSat data can be stored, transformed, navigated, and visualized. In this session we will explore how the LandSat dataset is stored in Amazon Simple Storage Service (S3), one of the recommended cloud storage services in AWS for storage of petabytes of data, and how data stored in S3 can be processed on the server with the Lambda service, visualized for users, and made available to search engines.
Create by: Ben Snively, Senior Solutions Architect
2. Agenda Overview
8:30 AM Welcome
8:45 AM Big Data in the Cloud
10:00 AM Break
10:15 AM Data Collection and Storage
11:30 AM Break
11:45 AM Real-time Event Processing
1:00 PM Lunch
1:30 PM HPC in the Cloud
2:45 PM Break
3:00 PM Processing and Analytics
4:15 PM Close
4. Real-Time Event Processing
âą Event-driven programming
âą Trigger action based on real-time input
Examples:
ï± Proactively detect errors in logs and devices
ï± Identify abnormal activity
ï± Monitor performance SLAs
ï± Notify when SLAs/performance drops below a threshold
5. Two main processing patterns
âą Stream processing (real time)
ï± Real-time response to events in data streams
ï± Relatively simple data computations (aggregates, filters, sliding window)
âą Micro-batching (near real time)
ï± Near real-time operations on small batches of events in data streams
ï± Standard processing and query engine for analysis
6. Youâre likely already âstreamingâ
âą Sensor networks analytics
âą Network analytics
âą Log shipping and centralization
âą Click stream analysis
âą Gaming status
âą Hardware and software appliance metrics
âą âŠmoreâŠ
âą Any proxy layer B organizing and passing data from A to C
ï± A to B to C
8. âą Streams are made of Shards
âą Each Shard ingests data up to 1MB/sec,
and up to 1000 TPS
âą Each Shard emits up to 2 MB/sec
âą All data is stored for 24 hours
âą Scale Kinesis streams by splitting or
merging Shards
âą Replay data inside of 24Hr. Window
âą Extensible to up to 7 days
Kinesis Stream & Shards
9. Amazon Kinesis
Why Stream Storage?
âą Decouple producers &
consumers
âą Temporary buffer
âą Preserve client ordering
âą Streaming MapReduce
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
Shard or Partition 1
Shard or Partition 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Producer 2
Producer 3
Producer N
Key = Red
Key = Green
10. How to Size your Kinesis Stream - Ingress
Suppose 2 Producers, each producing 2KB records at 500 KB/s:
Minimum Requirement: Ingress Capacity of 2 MB/s, Egress Capacity of 2MB/s
A theoretical minimum of 2 shards is required which will provide an ingress capacity of
2MB/s, and egress capacity 4 MB/s
Shard
Shard
1 MB/S
2 KB * 500 TPS = 1000KB/s
1 MB/S
2 KB * 500 TPS = 1000KB/s
Payment Processing Application
Producers
Theoretical Minimum of 2 Shards Required
11. How to Size your Kinesis Stream - Egress
Records are durably stored in Kinesis for 24 hours, allowing for multiple consuming
applications to process the data
Letâs extend the same example to have 3 consuming applications:
If all applications are reading at the ingress rate of 1MB/s per shard, an aggregate read capacity of 6
MB/s is required, exceeding the shardâs egress limit of 4MB/s
Solution: Simple! Add another shard to the stream to spread the load
Shard
Shard
1 MB/S
2 KB * 500 TPS = 1000KB/s
1 MB/S
2 KB * 500 TPS = 1000KB/s
Payment Processing Application
Fraud Detection Application
Recommendation Engine Application
Egress Bottleneck
Producers
12. âą Producers use PutRecord or PutRecords call
to store data in a Stream.
âą PutRecord {Data, StreamName, PartitionKey}
âą A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
âą Kinesis MD5 hashes supplied partition key over the
hash key range of a Shard
âą A unique Sequence # is returned to the Producer
upon a successful call
Putting Data into Kinesis
Simple Put interface to store data in Kinesis
17. Ad Network Logging Top-10 Detail - KCL
Global top-10
Elastic Beanstalk
foo-analysis.com
Data
Records Shard
KCL
Workers
My top-10
Kinesis
Stream
Kinesis
Application
Data Record
Shard:
Sequence Number
14 17 18 21 23
18. Ad Network Logging Top-10 Detail - Lambda
Global top-10
Elastic Beanstalk
foo-analysis.com
Data
Records Shard
Lambda
Workers
My top-10
Data Record
Shard:
Sequence Number
14 17 18 21 23
Kinesis
Stream
20. Kinesis Client Library (KCL)
âą Distributed to handle multiple
shards
âą Fault tolerant
âą Elastically adjust to shard
count
âą Helps with distributed
processing
âą Develop in Java, Python, Ruby,
Node.js, .NET
21. KCL Design Components
âą Worker:- Processing unit that maps to each application instance
âą Record processor:- The processing unit that processes data from
a shard of a Kinesis stream
âą Check-pointer: Keeps track of the records that have already
been processed in a given shard
âą KCL restarts the processing of the shard at the last known
processed record if a worker fails
KCL restarts the processing of the shard at the
last known processed record if a worker fails
22. Processing with Kinesis Client Library
âą Connects to the stream and enumerates the shards
âą Instantiates a record processor for every shard managed
âą Checkpoints processed records in Amazon DynamoDB
âą Balances shard-worker associations when the worker
instance count changes
âą Balances shard-worker associations when shards are
split or merged
23. Best practices for KCL applications
âą Leverage EC2 Auto Scaling Groups for your KCL Application
âą Move Data from an Amazon Kinesis stream to S3 for long-term persistence
ï± Use either Firehose or Build an âarchiverâ consumer application
âą Leverage durable storage like DynamoDB or S3 for processed data prior to
check-pointing
âą Duplicates: Ensure the authoritative data repository is resilient to
duplicates
âą Idempotent processing: Build a deterministic/repeatable system that can
achieve idempotence processing through check-pointing
24. Amazon Kinesis connector application
Connector Pipeline
Transformed
Filtered
Buffered
Emitted
Incoming
Records
Outgoing to
Endpoints
25. Amazon Kinesis Connector
âą Amazon S3
ï± Batch Write Files for Archive into S3
ï± Sequence Based File Naming
âą Amazon Redshift
ï± Micro-batching load to Redshift with manifest support
ï± User Defined message transformers
âą Amazon DynamoDB
ï± BatchPut append to table
ï± User defined message transformers
âą Elasticsearch
ï± Put to Elasticsearch cluster
ï± User defined message transforms
S3 Dynamo DB Redshift
Kinesis
27. Event-Driven Compute in the Cloud
âą Lambda functions: Stateless, request-driven code
execution
ï± Triggered by events in other services:
âą PUT to an Amazon S3 bucket
âą Write to an Amazon DynamoDB table
âą Record in an Amazon Kinesis stream
ï± Makes it easy toâŠ
âą Transform data as it reaches the cloud
âą Perform data-driven auditing, analysis, and notification
âą Kick off workflows
28. No Infrastructure to Manage
âą Focus on business logic, not infrastructure
âą Upload your code; AWS Lambda handles:
ï± Capacity
ï± Scaling
ï± Deployment
ï± Monitoring
ï± Logging
ï± Web service front end
ï± Security patching
29. Automatic Scaling
âą Lambda scales to match the event rate
âą Donât worry about over or under
provisioning
âą Pay only for what you use
âą New app or successful app, Lambda
matches your scale
30. Bring your own code
âą Create threads and processes, run batch scripts or
other executables, and read/write files in /tmp
âą Include any library with your Lambda
function code, even native libraries.
31. Fine-grained pricing
âą Buy compute time in 100ms
increments
âą Low request charge
âą No hourly, daily, or monthly
minimums
âą No per-device fees
Free Tier
1M requests and 400,000 GB-s of compute
Every month, every customer
Never pay
for idle
32. Data Triggers: Amazon S3
Amazon S3 Bucket Events AWS Lambda
Satelitte image Thumbnail of Satellite image
1
2
3
33. Data Triggers: Amazon DynamoDB
AWS LambdaAmazon DynamoDB
Table and Stream
Send Amazon SNS
notifications
Update another
table
34. Calling Lambda Functions
âą Call from mobile or web apps
ï± Wait for a response or send an event and continue
ï± AWS SDK, AWS Mobile SDK, REST API, CLI
âą Send events from Amazon S3 or SNS:
ï± One event per Lambda invocation, 3 attempts
âą Process DynamoDB changes or Amazon Kinesis
records as events:
ï± Ordered model with multiple records per event
ï± Unlimited retries (until data expires)
35. Writing Lambda Functions
âą The Basics
ï± Stock Node.js, Java, Python
ï± AWS SDK comes built in and ready to use
ï± Lambda handles inbound traffic
âą Stateless
ï± Use S3, DynamoDB, or other Internet storage for persistent data
ï± Donât expect affinity to the infrastructure (you canât âlog in to the boxâ)
âą Familiar
ï± Use processes, threads, /tmp, sockets, âŠ
ï± Bring your own libraries, even native ones
36. How can you use these features?
âI want to send
customized
messages to
different usersâ
SNS + Lambda
âI want to send
an offer when a
user runs out of
lives in my gameâ
Amazon Cognito +
Lambda + SNS
âI want to
transform the
records in a click
stream or an IoT
data streamâ
Amazon Kinesis +
Lambda
38. Amazon EMR integration
Read Data Directly into Hive, Pig,
Streaming and Cascading
âą Real time sources into Batch
Oriented Systems
âą Multi-Application Support
and Check-pointing
39. CREATE TABLE call_data_records (
start_time bigint,
end_time bigint,
phone_number STRING,
carrier STRING,
recorded_duration bigint,
calculated_duration bigint,
lat double,
long double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ","
STORED BY
'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"=âMyTestStream");
Amazon EMR integration: Hive
41. âą Higher level abstraction called Discretized
Streams of DStreams
âą Represented as a sequence of Resilient
Distributed Datasets (RDDs)
DStream
RDD@T1 RDD@T2
Messages
Receiver
Spark Streaming â Basic concepts
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
42. Apache Spark Streaming
âą Window based transformations
ï± countByWindow, countByValueAndWindow etc.
âą Scalability
ï± Partition input stream
ï± Each receiver can be run on separate worker
âą Fault tolerance
ï± Write Ahead Log (WAL) support for Streaming
ï± Stateful exactly-once semantics
48. Youâre likely already âstreamingâ
âą Embrace âstream thinkingâ
âą Event Processing tools are available
that will help increase your solutionsâ
functionality, availability and
durability