Real Time Big Data Processing on AWS

@CasertaConcepts
Real Time Big Data
Processing on AWS
Presented by:

@CasertaConcepts
About Caserta Concepts
• Consulting firm focused on Data Innovation, Modern Data Engineering to solve
highly complex business data challenges
• Award-winning company
• Internationally recognized work force
• Mentoring, Training, Knowledge Transfer
• Strategy, Architecture, Implementation
• Innovation Partner
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Leader in architecting and implementing enterprise data solutions
• Data Warehousing
• Business Intelligence
• Big Data Analytics
• Data Science
• Data on the Cloud
• Data Interaction & Visualization
• Strategic Consulting
• Technical Design
• Build & Deploy Solutions

@CasertaConcepts
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance

@CasertaConcepts
Awards & Recognition

@CasertaConcepts
Come out and Play
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,300+ Members
http://www.meetup.com/Big-Data-Warehousing/
Examples of Previous Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data Monitoring
with Storm & Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/Claudia
Perlcih & Revolution Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN

@CasertaConcepts
REALTIME Analytics
Presented by:
Elliott Cordo
Chief Architect, Caserta Concepts

@CasertaConcepts
What is real-time?
• Latency between data creation and analytics?
• Is it the speed with which we can retrieve the answer?
In most cases it’s both..

@CasertaConcepts
So, how real time?
How do we measure:
• 1 Hour?
• 5 Minutes?
• Seconds?
• Microseconds?
For all practical purposes:
• As fast as possible
• Fast enough to deliver the required insights
• “Near-Real-Time”

@CasertaConcepts
Real time
Two main methods:
•Micro-batch  “traditional” ETL, just faster
•Events based  events are “pushed” or “pulled”
through a pipeline

@CasertaConcepts
Microbatch
• Traditional batch ETL concepts
• Identify and accrue a batch of data that needs to be processed
• Batch Control  where did I last leave off
• CDC – Change Data Capture  what changed
• Process all accrued data in a single batch
Rinse and Repeat

@CasertaConcepts
Pros and Cons to Microbatch
• Pros:
• Leverage existing batch ETL code
• Data can have a known cutoff window  “Sales as of 10pm”
• Wide array of technologies
• Easy to troubleshoot and debug
• Easy to recover from failures  replay the batch
• Cons
• Results are not real time  as snapshot “as of” some time prior
• Can be difficult to support increasingly tight SLA’s

@CasertaConcepts
Technologies for Microbatch
• All the usual suspects:
• Traditional ETL tools
• Hadoop Ecosystem  PIG and Hive
• Code  Python, SQL, Scala, etc.
• Apache Spark (batch, streaming*)
• New AWS Services  Kinesis Firehose
• Load Data to S3 and Redshift Directly from a Kinesis Stream

@CasertaConcepts
Events based
• Data is processed as it is ingested  not accrued and processed as a
batch
• As close to real-time as you can get
• Typically the source is a message queue

@CasertaConcepts
Events Based Pros and Cons
Pros:
• Near real time processing
Cons:
• Generally more difficult (development and administrative)
• Generally does not eliminate batch ETL
• Typically a different code base than existing batch ETL
• Can be difficult to recover from failure

@CasertaConcepts
Technologies for Event Based
• Apache Storm
• Apache Spark*
• CEP Engines
• New AWS Services 
• AWS Lambda

@CasertaConcepts
Lambda Architecture
Speed and Batch Layer
• Batch ETL and Real-time are used together
• Real-time insights from Speed
• Cleanup/correction and advanced calculations performed by Batch

@CasertaConcepts
Data Stores
• Microbatch architecture  many options, based on data size
and usage patterns
• Events Based  NOSQL, In-Memory, Search:
• Write throughput requirements
• Fast reads
• Simplicity
• But we sacrifice query flexibility:
• Decisions about what metrics are “real-time”
• More ETL

@CasertaConcepts
Thank You / Q&A
Elliott Cordo
Chief Architect, Caserta Concepts
1-855-755-2246
elliott@casertaconcepts.com

Real Time Big Data Processing on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real Time Big Data Processing on AWS

Similar to Real Time Big Data Processing on AWS (20)

More from Caserta

More from Caserta (20)

Recently uploaded

Recently uploaded (20)

Real Time Big Data Processing on AWS

Editor's Notes