During a Big Data Warehousing Meetup in NYC, Elliott Cordo, Chief Architect at Caserta Concepts discussed emerging trends in real time data processing. The presentation included processing frameworks such as Spark and Storm, as well datastore technologies ranging from NoSQL to Hadoop. He also discussed exciting new AWS services such as Lambda, Kenesis, and Kenesis Firehose.
2. @CasertaConcepts
About Caserta Concepts
• Consulting firm focused on Data Innovation, Modern Data Engineering to solve
highly complex business data challenges
• Award-winning company
• Internationally recognized work force
• Mentoring, Training, Knowledge Transfer
• Strategy, Architecture, Implementation
• Innovation Partner
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Leader in architecting and implementing enterprise data solutions
• Data Warehousing
• Business Intelligence
• Big Data Analytics
• Data Science
• Data on the Cloud
• Data Interaction & Visualization
• Strategic Consulting
• Technical Design
• Build & Deploy Solutions
6. @CasertaConcepts
Come out and Play
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,300+ Members
http://www.meetup.com/Big-Data-Warehousing/
Examples of Previous Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data Monitoring
with Storm & Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/Claudia
Perlcih & Revolution Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN
8. @CasertaConcepts
What is real-time?
• Latency between data creation and analytics?
• Is it the speed with which we can retrieve the answer?
In most cases it’s both..
9. @CasertaConcepts
So, how real time?
How do we measure:
• 1 Hour?
• 5 Minutes?
• Seconds?
• Microseconds?
For all practical purposes:
• As fast as possible
• Fast enough to deliver the required insights
• “Near-Real-Time”
10. @CasertaConcepts
Real time
Two main methods:
•Micro-batch “traditional” ETL, just faster
•Events based events are “pushed” or “pulled”
through a pipeline
11. @CasertaConcepts
Microbatch
• Traditional batch ETL concepts
• Identify and accrue a batch of data that needs to be processed
• Batch Control where did I last leave off
• CDC – Change Data Capture what changed
• Process all accrued data in a single batch
Rinse and Repeat
12. @CasertaConcepts
Pros and Cons to Microbatch
• Pros:
• Leverage existing batch ETL code
• Data can have a known cutoff window “Sales as of 10pm”
• Wide array of technologies
• Easy to troubleshoot and debug
• Easy to recover from failures replay the batch
• Cons
• Results are not real time as snapshot “as of” some time prior
• Can be difficult to support increasingly tight SLA’s
13. @CasertaConcepts
Technologies for Microbatch
• All the usual suspects:
• Traditional ETL tools
• Hadoop Ecosystem PIG and Hive
• Code Python, SQL, Scala, etc.
• Apache Spark (batch, streaming*)
• New AWS Services Kinesis Firehose
• Load Data to S3 and Redshift Directly from a Kinesis Stream
14. @CasertaConcepts
Events based
• Data is processed as it is ingested not accrued and processed as a
batch
• As close to real-time as you can get
• Typically the source is a message queue
15. @CasertaConcepts
Events Based Pros and Cons
Pros:
• Near real time processing
Cons:
• Generally more difficult (development and administrative)
• Generally does not eliminate batch ETL
• Typically a different code base than existing batch ETL
• Can be difficult to recover from failure
17. @CasertaConcepts
Lambda Architecture
Speed and Batch Layer
• Batch ETL and Real-time are used together
• Real-time insights from Speed
• Cleanup/correction and advanced calculations performed by Batch
18. @CasertaConcepts
Data Stores
• Microbatch architecture many options, based on data size
and usage patterns
• Events Based NOSQL, In-Memory, Search:
• Write throughput requirements
• Fast reads
• Simplicity
• But we sacrifice query flexibility:
• Decisions about what metrics are “real-time”
• More ETL
a consequence of having built a strong innovative business - Awards & recognition - recognized in the market in 2013, 2014, 2015
They demonstrate sustained recognition over the years and not just many years ago - recent
5th of IT in NYC
developing next new set of best practices, talking to practitioners, understanding current trends in the marketplaces
staying relevant and ahead of the curve
create a sense of community, sharing best practices, past experiences