Big Data Warehousing Meetup with Riak

About the BDW Meetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Founded by Caserta Concepts, DW, BI &
Big Data Analytics Consulting
• Next BDW Meetup: September 23

About Caserta Concepts
• Financial Services
• Healthcare / Insurance
• Retail / eCommerce
• Digital Media / Marketing
• K-12 / Higher Education
Industries Served
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
Founded in 2001
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Strategic Data
Ecosystems
Focused
Expertise

Implementation Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Database
BI/Visualization/
Analytics
Master Data Management
Big Data
Analytics
Storm

WHY NOSQL MATTERS… FOR ANALYTICS
Elliott Cordo
Principal Consultant, Caserta Concepts

NOSQL So what?
• NoSQL is one of the most exciting movements in BIG DATA.
• NoSQL is changing the way a lot of people think about
application development  especially analytic applications
• Not all data is efficiently stored or processed in a relational DB.
• High data volumes
• Data does not fit or require relational model
• We have new tools in our arsenal for processing, storing, and
analyzing data with these new challenges

But we love SQL
• Relational databases still have their place
• Flexible
• Rich Query Syntax
• THEY HAVE JOINS AND AGGREGATION!!
• The relational DB is great at being general purpose!
• You build nice normalized structure, establish the logical
relationships and then you can build any query you need for your
application.
• This has kept us happy in the Data Warehousing (and
app-dev) world for decades!

Scale and Performance
Performance:
• Relational databases have a lot of features, overhead that
we in many cases don’t need.
Scale out:
• Most relational databases scale vertically giving them
limits to how large they can get. Federation and Sharding
is an awkward manual process.
• Most NoSQL scale horizontally on commodity hardware

But what will we sacrifice?
• Query Features:
• NoSQL DB’s have fairly simple query languages. Limited or no support for
the following outside of map reduce:
• Joins
• Aggregation
Why? - NoSQL databases were born to be high performance.
Data is stored as it is to be used (tuned to a query) rather than modeled
around entities. So a sophisticated query language is not needed.
• BI and ETL tool support limited

So what about NoSQL for Analytics?
• NoSQL databases are generally not as flexible as relational databases
for ad-hoc questions.
• Secondary indexes provide some flexibility but lack of Joins generally
requires denormalization
• Materialized views: Joins and aggregates can be implemented via Map
Reduce  However materializing the world has it’s drawbacks!
• A different way of doing things:
• Client-side join: query “dimensions” get key  query “fact” on secondary index.
• Link walking  leverage metadata link between entities  Riak!
• Ad-hoc Map Reduce jobs
• Aggregate Navigation  navigate from aggregate entities of different grain
• Search!
Limited native
BI Tool Support!

NoSQL can be a great fit for analytic
applications!!
• High volumes/Low Latency analytic
environments
• Queries are largely known and can be
precomuted in-stream (via application itself or
Storm) or in batch using Map Reduce
• Sweet spot is very high volumes with relatively
static analytic requirements.
• Common Design Pattern:
• Compute aggregates and events in-line and
store to aggregate entities in NoSQL
• Write enriched detail records to NoSQL or
Hadoop for further processing
RDBMS NoSQL
Volume
QueryFlexibility

BIG ETL
• One of the promises of BIG DATA is being able to enrich,
process, enormous data volumes
• The processing engines:
• Storm  inline, real-time processing
• Hadoop  batch processing
• NoSQL can play an integral part of this architecture:
• Distributed Lookup Cache
• Shared State
• Queueing Mechanism
Storm Topology
Data
Sources
Relational
EDW
Analytic Data-stores

NoSQL databases are like snowflakes or
Smurfs..
They are all special and no two are alike!
Let’s review the main categories and determine their
general fit for analytic applications

• Platforms: Riak, Redis
• Buckets/Collections are equivalent to a table in a RDMS
• Primary unit of storage is a key value pair, the value can be
anything ranging from number to a JSON Document.
• Key value stores are super fast and simple
• Analytic Capabilities:
• Although many have very spartan feature sets some platforms like Riak
have analytic friendly Links, Tags Metadata, and powerful map reduce
capabilities!
• Generally writes and reads are ultra-fast  good candidate for a BIG
ETL component
Key Value

• Platforms: Cassandra, HBase
• Column families are the equivalent to a table in a RDMS
• Primary unit of storage is a column, they are stored contiguously
Skinny Rows: Most like relational database. Except columns are optional and not
stored if omitted:
Wide Rows: Rows can be billions of columns wide, used for time series,
relationships, analytics!
Analytic Capabilities: Widely used in analytic applications Typical analytic
design pattern is to use “Skinny Rows” for detail records, “Wide Rows” for
aggregates.
Columnar

Document
• Platforms: MongoDB, CouchDB
• Collections are the equivalent to a table in a RDMS
• Primary unit of storage is a document
{ “User" : ”Bobby”,
“Email”: bobby@db-lover.com,
“Channel”: “Web”,
“State”: “NJ” }
{ “User" : ”Susie”,
“Email”: “Susie@sql-enthusiast.com”,
“PreferredCategories: [
{ Category: “Fashion”,
CategoryAdded: “2012-01-01” },
{ Category: “Outdoor Equipment”,
CategoryAdded: “2013-01-01” } ],
“Channel”: In-Store }
Analytic Capabilities: most similar to relational in function but requires
denormalization. Secondary index support, map reduce. Mongo has a cool
new aggregation framework.

In the Real World: High Volume Sensor
Analytics
• Ingestion and analytics of Sensor Data
• 6 to 12 BILLION records being ingested daily (average
500k records per second at peek load)!
• Ingested data must be stored to disk and highly available
• Pre-defined aggregates and event monitors must be near
real-time
• Ad-hoc query capabilities required on historical data

One way to do it.. that worked
Storm Cluster
Sensor
Data
d3.js Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data
and derived data needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated
data.
• Higher latency ad-hoc analytics are done in Hadoop
using Pig and Hive
Kafka

Parting Thought
Polyglot Persistence – “where any decent sized
enterprise will have a variety of different data storage
technologies for different kinds of data. There will still
be large amounts of it managed in relational stores,
but increasingly we'll be first asking how we want to
manipulate the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler

Contact
Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com

Big Data Warehousing Meetup with Riak

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Big Data Warehousing Meetup with Riak

Semelhante a Big Data Warehousing Meetup with Riak (20)

Mais de Caserta

Mais de Caserta (20)

Último

Último (20)

Big Data Warehousing Meetup with Riak