The slides from my talk at the AWS DevDays in the Nordics.
https://aws.amazon.com/events/Devdays-Nordics/agenda/
Objectives:
- Understand Serverless Key Concepts.
- Understand Event Processing Architecture.
- Understand Operation Automation Architecture.
- Understand Web Application Architecture.
- Understand Data Processing Architecture.
* Kinesis-based apps.
* IoT-based apps.
Also a classic in Data Lake designs.
Amazon S3Â is a simple key-based object store whose scalability and low cost make it ideal for storing large datasets.
S3 to provide excellent performance for storing and retrieving objects based on a known key.
Taking advantage of AWS Lambda event-driven triggers from S3.
S3 storage tier â lifecycle policies
Modern businesses typically collect data from internal and external sources at various frequencies throughout the day. These data sources could be franchise stores, subsidiaries, or new systems integrated as a result of merger and acquisitions.
For example, a retail chain might collect point-of-sale (POS) data from all franchise stores three times a day to get insights into sales as well as to identify the right number of staff at a given time in any given store. As each franchise functions as an independent business, the format and structure of the data might not be consistent across the board. Depending on the geographical region, each franchise would provide data at a different frequency and the analysis of these datasets should wait until all the required data is provided (event-driven) from the individual franchises. In most cases, the individual data volumes received from each franchise are usually small but the velocity of the data being generated and the collective volume can be challenging to manage.
Narrative: You need a different set of analytical tools to collect and analyze real-time streaming data than what you have traditionally used for data at rest. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach. Instead of running database queries on stored data, streaming analytics platforms have to process the data continuously and before the data lands in a database. And streaming data comes in at an incredible rate that can vary up and down all the time. Streaming analytics platforms have to be able to process this data when it arrives, often at speeds of millions and even tens of millions of events per hour.
Key requirements of stream processing
Durable: Durable ingest so that processing can be repeatable;
Continuous - Need to always be processing the latest data
Fast: Frequency (micro batches, size of batches, true streaming), and speed (sub-second, minute, hour)
Correct: at most once, at least once, and exactly once processing; event time, ingest time, processing time.
Reactive: Ability to process and respond in near real-time; feedback mechanisms to send processed data to live applications
Reliable: Highly available, fast failovers
A foundation of highly durable data storage and streaming of any type of data
A metadata index and workflow which helps us categorise and govern data stored in the data lake
A search index and workflow which enables data discovery
A robust set of security controls â governance through technology, not policy
An API and user interface that expose these features to internal and external users
All 3 storage classes are highly dur designed for 11 9s
Standard, designed for the active data, hot workload. High performance Designed for 4 9s availability. General Purpose storage class, new data frequently access, start with standard. Starting at 2.3c/GB.
As data age, access less. SIA, designed for colder or less frequently access data, Offers same high performance3 9s of availability., high throughput and low latency as S3 Standard. starting at 1.25c/GB, 45% lower in storage cost. There is a $0.01/gb retrieval cost.
As the data age further, no one is actively interacting with the old data and you need for record keeping. Glacier designed for long term archival storage. 4/10th of a cent, 3 retrieval options ranging from minutes to hours retrieval, you choose depending on quickly you need the data.
lifecycle
For data that is less-frequently accessed, you can leverage Amazon S3 Standard-IA to save on cost while still benefiting from the great durability and performance as S3 Standard.
In addition to transitioning your data to S-IA as its characteristics change, you can also leverage Amazon S3 Standard-IA for new data that fits the bill for Infrequently accessed data. For example you can leverage the S-IA storage class to stored detailed applications logs that you analyse in-frequently and save on storage cost.
if your storage is for Big data analysis, content distribution, website, consider using S3 standard which is designed for the active, hot analytics workload. Netflix and finra
for backup and archive, DR, you generally donât access the data until the rare occasion when you need it, when you need it, you need it right away, SIA is designed for those use cases. consider directly putting into Standard-IA.
Glacier long term archive. Sony moved over 1M hours of video from magnetic tape to Glacier for digital preservation
As data age, you access the objects less
Since Amazon Kinesis launch in 2013, the ecosystem evolved and we introduced Kinesis Firehose and Kinesis Analytics.
Streams was launched in GA at re:Invent 2014, Firehose at re:Invent 2015, and Analytics was launched in August 2016
We have continuously iterated to make it easier for customers to use streaming data, as well as expand the functionality of real-time processing
Together, these three products make up the Amazon Kinesis streaming data platform
A shard is a group of data records in a stream. When you create a stream, you specify the numberof shards for the stream.
Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The total capacity of a stream is the sum of the capacities of its shards. You can increase or decrease the number of shards in a stream as needed. However, note that you are charged on a per-shard basis.
Feedback: break into two, flow for introducing windowing needs revision
Tumbling:
Repeat
Are non-overlapping
An event can belong to only one tumbling window
Sliding:
Continuously moves forward in time
Produces an output only during the occurrence of an event
Every window will have at least one event
Events can belong to more than one sliding window
AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information. AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a connected device. You can also use AWS Lambda to create new back-end services where compute resources are automatically triggered based on custom requests. With AWS Lambda you pay only for the requests served and the compute time required to run your code. Billing is metered in increments of 100 milliseconds, making it cost-effective and easy to scale automatically from a few requests per day to thousands per second.
Sometimes, though, you may want to analyze the anomalies or at least be notified of their presence.
such algo could be : calculating an average and standard deviation of the time-series data
Data is sent from our sensor to AWS IoT, where it is routed to AWS Lambda through the AWS IoT Rules Engine. Lambda executes the logic for anomaly detection and because the algorithm requires knowledge of previous measurements, uses Amazon DynamoDB as a key-value store. The Lambda function republishes the message along with parameters extracted from the PEWMA algorithm. The results can be viewed in your browser through a WebSocket connection to AWS IoT on your local machine
To build that, no genius was involved .. but like Edison, we stitched innovations together to create a Wow experience for our customers. (security, scalable, reliable..)
And this is exactly what the cloud is about!
Just few years ago, it would have taken months or even years to build a scalable, reliable and secure application like that from scratch.