Analysing streaming data in real time (AWS)

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Analysing streaming data in real time
Javier Ramirez
@supercoco9
AWS Tech Evangelist
A N T 2
Ville Kurkinen
Principal Architect
F-Secure Oyj

A simpleproblem (untilyou knowthedetails)
• I want to calculate the total and average of several numbers

A simplebig dataproblem (untilyou knowthedetails)
• They might be MANY numbers, more than you can store in memory, or in
a single hard drive

A simplestreamingproblem
a single hard drive
• The dataset is not static, new numbers are coming all the time

Asimplishstreamingproblem
a single hard drive
• From different sensors, which are geo distributed and moving. We will be
adding and removing sensors all the time

A quitestandard streaming problem
a single hard drive
• And since they use 3G and batteries, some might go quiet for a while
and then send a bunch of stale data

A elasticand scalablestreamingproblem
a single hard drive
• Flow will not be constant (from few events per second to thousands)

An almostreal-lifestreaming analyticsscenario
a single hard drive
• And I don’t want just the total average, but total per month, per week, per
day, per hour, per minute…

A realbusiness problem you cansolvewithstreaming
• They might be MANY numbers, more than you can store in memory, or in a single hard drive
• From different sensors, which are geo distributed and moving. We will be adding and removing sensors all the time
• And since they use 3G and batteries, some might go quiet for a while and then send a bunch of stale data
• And I don’t want just the total average, but total per month, per week, per day, per hour, per minute…
• We need pretty dashboards with current status, comparison with the
past, trends, and anomaly detection
• To run this reliably, we need advanced monitoring, alerts, and
autoscaling
• No, I am not hiring a whole new operations team to manage the system

Probably lessthanyou think
~20 lines of JAVA code (plus a few
hundreds with imports, POJOs,
and boilerplate, because JAVA)
a simple GROUP BY statement in
SQL with streaming extensions
(plus a few lines of boilerplate for
schema definition)
OR

Apache Kafka
A distributed streaming platform
Apache Flink
Stateful computations over data streams
Elasticsearch
Search & Analyze data in real time

Distributed systemsarehard tomanage at scale

Software & Internet Education Technology BioTech and Pharma
Media and EntertainmentFinancial Services Social Media
Telecommunications Travel & Transportation Real Estate
Logistics & Operations Publishing Other

Amazon and open source
Amazon is committed to improving open-source
Apache Kafka and Elasticsearch
https://aws.amazon.com/opensource/

Amazon Go
video analytics
Amazon.com
online catalog
Amazon
CloudWatch
logs
Amazon
S3 events
AWS
metering

Amazon KinesisData Firehose
• Zero administration and seamless elasticity
• Direct-to-data store integration
• Serverless continuous data transformations
• Near real-time

Ingest Transform Deliver
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka

Amazon KinesisDataStreams
• Easy administration and low cost
• Real-time, elastic performance
• Secure, durable storage
• Available to multiple real-time analytics applications

Amazon Kinesis - Firehose vs. Streams
Amazon Kinesis Data Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing latency, and
a choice of stream processing frameworks. Allows multiple consumers,
different consumer patterns, and stream replay
Amazon Kinesis Data Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on Amazon S3,
Amazon Redshift, and Amazon ES, and a data latency of 60 seconds or
higher
Kinesis Data
Streams
Kinesis Data
Firehose
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.SU M M I T
Amazon Kinesis - Firehose vs. Streams
Amazon Kinesis Data Streams isf or use casest hat require custom
processing, per incoming record, wit h sub-1 second processing latency, and
a choice of stream processing frameworks. Allows multiple consumers,
different consumer patterns, and stream replay
Amazon Kinesis Data Firehose isf or use casest hat require zero
administration, ability t o use existing analytics tools based on Amazon S3,
Amazon Redshift, and Amazon ES, and a data latency of 60 secondsor
higher
Kinesis Data
Streams
Kinesis Data
Firehose

Dataisstoredintheorderitwasreceivedforasetduration
oftime,andcanbereplayedindefinitelyduringthistime.

•AT_SEQUENCE_NUMBER - Start reading from the position denoted by a specific sequence number,
provided in the value StartingSequenceNumber.
•AFTER_SEQUENCE_NUMBER - Start reading right after the position denoted by a specific sequence
number, provided in the value StartingSequenceNumber.
•AT_TIMESTAMP - Start reading from the position denoted by a specific time stamp, provided in the
value Timestamp.
•TRIM_HORIZON - Start reading at the last untrimmed record in the shard in the system, which is the oldest
data record in the shard.
•LATEST - Start reading just after the most recent record in the shard, so that you always read the most
recent data in the shard.

Time-based
seek

Log processing atNetflixusing KinesisDataStreams
Netflix’s Amazon Kinesis Streams-based solution has proven to be highly scalable, each day
processing billions of traffic flows. Typically, about 1,000 Amazon Kinesis shards work in
parallel to process the data stream. “Amazon Kinesis Streams processes multiple terabytes of
log data each day, yet events show up in our analytics in seconds. We can discover and
respond to issues in real time, ensuring high availability and a great customer experience.”
“Our solution built on Amazon Kinesis enables us to identify ways to increase efficiency, reduce
costs, and improve resiliency for the best customer experience,”
John BennettSenior Software Engineer, Netflix

Amazon S3
Amazon Redshift
Amazon Elasticsearch
Splunk
Real-Time Applications (seconds)
Streaming ETL (minutes)
Stream Ingestion
[Wed Oct 11 14:32:52 2018]
[error] [client 127.0.0.1]
client denied by server
configuration:
/export/home/live/ap/htdocs
/test
Mobile device
Metering
Click streams
IoT sensors
Logs
AWS SDKsAmazon
Kinesis Agent
Amazon Kinesis
Producer Library
AmazonKinesis
ConsumerLibrary

Processing a data streamwithApacheSpark
https://spark.apache.org/docs/2.3.1/streaming-kinesis-integration.htm
l

Processing a data streamwithAWS Lambda
data
producer
Kinesis Data
Streams
Amazon
SNS
Continuously stream data
Lambda
service
Lambda
functionA
Lambda
function B
Continuously polls for new data,
1 poll per second
Automatically invokes your
function(s) when data found
• Stateless
• Lambda polls each shard once per second
• Scales with your data

ANALYSING CYBER
THREATS IN NEAR REAL-
TIME
Ville Kurkinen
Principal Architect
F-Secure Oyj
Finland

43
We are trusted by
companies for which cyber
security is absolutely
critical
5/5
Top UK Banks
3/5
Top US Banks
3/5
Top Singapore
Banks
4/5
Top South African
Banks
5/5
Top Nordic Banks
Endpoint protection
New cyber
security
solutions
F-SECURE• Founded in 1988
• +1600 employees
• Listed on NASDAQ OMX, Helsinki
• ~30 offices around the globe
• Revenue of €190 million in 2018
• +100,000 corporate customers and tens of millions of consumer
customers.

© F-Secure44
F-SECURE RAPID DETECTION & RESPONSE
SERVICE
Email
notification
with details
in portal
Phone call in
case of an
incident
Rapid
30-minute
Detection to
Response
24/7
Threat
Hunting
Service
Actionable
Expert
Guidance to
Respond
Direct Dialog
with Threat
Analysts
Global
Intelligence
Reports

Decoy
Sensor
s
RAPID DETECTION & RESPONSE
SERVICE:
COMBINING MAN & MACHINE
© F-Secure
F-SECURE RAPID DETECTION
& RESPONSE CENTER
Threat
hunters
Incident
responders
Forensic
experts
Windows
Sensors
Mac
Sensors
Linux
Sensors
YOUR ORGANIZATION
Router
Internet
Attacker Network
Sensor
ANOMALY
CLOUD-BASED AI/ML
ANALYTICS PLATFORM
Big data
analytics
Real-time
behavior
analytics
Reputationa
l analytics
RESPONSE
GUIDANCE
SOC
CSIRT
IT Help Desk
Partner
IoT

DETECT ATTACKS IN MINUTES
WITHOUT DROWNING IN ALERTS
2 billionDATA EVENTS/MONTH
• Endpoint sensors
• Network sensors
• Decoy sensors
Average number from a customer
organization with ~1300 endpoints
25DETECTIONS
Detections of
which customer
was notified
After threat hunters have
analyzed the machine filtered
detections
15REAL THREATS
Customer confirmed
that these were
real threats
900,000SUSPICIOUS EVENTS
Real-time behavioral analysis of
the raw data events supported by
AI and machine learning
Training set:
True / false positive
decisions by the hunters
Event
Enrichment
Host & User
Profiling
Anomaly
Detection
Detection
Significance
Analysis

ANALYZED EVENTS
PER DEPLOYMENT
© F-Secure Confidential

ARCHITECTURE
© F-Secure Confidential

© F-Secure Confidential49
1
2 3
4
5
6
7
8

Managed Kafka
Migrating from RabbitMQ to
Managed Kafka as stateful data
processing infrastructure.
Kinesis Data
Analytics
More real-time processing of
statistics data calculation from
telemetry and statistics streams.
Kinesis auto-
scaling
Automating Kinesis shard
management by splitting /
merging shards based on
load for increased
elasticity and cost
management.
WHAT’S NEXT?
© F-Secure

Amazon KinesisDataAnalytics
• Interact with streaming data in real-time using SQL or integrated Java applications
• Build fully managed and elastic stream processing applications

KDA for Java for sophisticated applications
UtilizesApache Flink, a Framework and distributed engine for stateful
processing of data streams
Simple
programming
High performance
Stateful
Processing
Strong data
integrity
Easy to use and
flexible APIs make
building apps fast
In-memory
computing provides
low latency & high
throughput
Durable
application state
saves
Exactly-once
processing and
consistent state

KinesisDataAnalytics–JavaApplications
Build Java applications
using open source
(Apache Flink)
Upload your application
code to Kinesis Data
Analytics
Run your application in a
fully managed and elastic
service
1 2 3

How do you build an application?
Streaming operators are applied to data streams in a pipeline
Source
Sink
DataStream
KeyedDataStream
DataStream
Sink
keyBy,
window
filter
apply

Extensibleintegrations withAWS services
• Easily add sources and sinks to an application
• Build custom connectors for other data sources and sinks
Example Sources
Example
Destinations (Sinks)
Apache Kafka
Apache Kafka RabbitMQ
RabbitMQ ElasticSearchApache
Cassandra

Automaticallybackup your application
Create and restore your application to a previous point-
in-time (snapshots)
Running application state is automatically backed up
by default (checkpoints)

Application scaling– resources and parallelism
Resources
• Kinesis Process Unit (KPUs) used to run
code
• Each KPU is 1 vCPU and 4 GB memory
• 50 GB of running application storage per
KPU
• Automatic or provisioned scaling
Parallelism
• Number of instances of a task
• Default versus operator parallelism
• Maximum defines the largest possible
parallelism for an application

KDA for SQL for simple and fast use cases
• Sub-second end to end processing latencies
• SQL steps can be chained together in serial or parallel steps
• Build applications with one or hundreds of queries
• Pre-built functions include everything from sum and count
distinct to machine learning algorithms
• Aggregations run continuously using window operators
• Fully managed and elastic

Easily connect to Kinesis Data streams and
Kinesis Data Firehose delivery streams
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose

WritingStreamingSQL
Pumps (continuous query)
CREATE OR REPLACE PUMP calls_per_ip_pump AS
INSERT INTO calls_per_ip_stream
SELECT STREAM "eventTimestamp",
COUNT(*),
"sourceIPAddress"
FROM source_sql_stream_001 ctrail
GROUP BY "sourceIPAddress",
STEP(ctrail.ROWTIME BY INTERVAL '1' MINUTE),
STEP(ctrail."eventTimestamp" BY INTERVAL '1'
MINUTE);

Anomaly detection withSQL
Pumps (continuous query)
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO
"DESTINATION_SQL_STREAM"
SELECT "ANOMALY_SCORE", "ANOMALY_EXPLANATION" FROM
TABLE
(RANDOM_CUT_FOREST_WITH_EXPLANATION(CURSOR(SELECT
STREAM * FROM "SOURCE_SQL_STREAM_001"), 100, 256,
100000, 1, true)) WHERE ANOMALY_SCORE > 0

AggregatingStreamingData?
• Aggregations (count, sum, min,…) take granular real time data and turn it into
insights
• Data is continuously processed so you need to tell the application when you
want results
• Tumbling windows, sliding windows, and custom windows

In-application stream
Amazon Kinesis Data Analytics application
SQL code joining
table and stream
streaming source destination
Amazon
S3
In-application table

https://aws.amazon.com/blogs/big-data/build-and-run-streaming-applications-with-apache-flink-
and-amazon-kinesis-data-analytics-for-java-applications/

aws.amazon.com/kinesis
aws.amazon.com/kinesis/getting-started
aws.amazon.com/msk
aws.amazon.com/msk/getting-started

Thank you!
Javier Ramirez
@supercoco9
Ville Kurkinen
Principal Architect
F-Secure Oyj

Analysing streaming data in real time (AWS)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Analysing streaming data in real time (AWS)

Semelhante a Analysing streaming data in real time (AWS) (20)

Mais de javier ramirez

Mais de javier ramirez (20)

Último

Último (20)

Analysing streaming data in real time (AWS)

Notas do Editor