SlideShare uma empresa Scribd logo
1 de 47
SnappyData
Getting Spark ready for real-time,
operational analytics
www.snappydata.io
Jags Ramnarayan
jramnarayan@snappydata.io
Co-founder SnappyData
Nov 2015
SnappyData - an EMC/Pivotal spin out
● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io
Lambda Architecture (LA) for Analytics
Perspective on LA for real time
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Use case: Telemetry
Revenue Generation
Real-time Location based
Mobile Advertising (B2B2C)
Location Based Services (B2C,
B2B, B2B2C)
Revenue Protection
Customer experience
management to reduce churn
Customers Sentiment analysis
Network Efficiency
Network bandwidth
optimisation
Network signalling
maximisation
• Network optimization
– E.g. re-reroute call to another cell tower if congestion detected
• Location based Ads
– Match incoming event to Subscriber profile; If ‘Opt-in’ show location sensitive Ad
• Challenge: Too much streaming data
– Many subscribers, lots of 2G/3G/4G voice/data
– Network events: location events, CDRs, network issues
Challenge - Keeping up with streams
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
• Millions of events/sec
• HA – Continuously Ingest
• Cannot throttle the stream
• Diverse formats
Challenge - Transform is expensive
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
• Filter, Normalize, transform
• Need reference data to
normalize – point lookups
Reference DB
(Enterprise
Oracle, …)
Challenge - Stream joins, correlations
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Analyze over time window
● Simple rules - (CallDroppedCount > threshold) then alert
● Or, Complex (OLAP like query)
● TopK, Trending, Join with reference data, correlate with history
How do you keep up with OLAP style analytics with millions of
events in window and billions of records in ref data?
Challenge - State management
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Manage generated state
● Mutating state: millions of counters
● “Once and only once”
● Consistency across distributed system
● State HA
Challenge - Interactive Query speed
In-Memory DB
Interactive queries,
updates
Deep Scale, High
volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Interactive queries
- OLAP style queries
- High concurrency
- Low response time
Today: queue -> process -> NoSQL
Messaging cluster
adds extra hops,
management
No distributed, HA Data store
Streaming joins, or with
external state is slow and not
scalable in many cases
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream
for real-time analytics
Batch design, high throughput
Real-time
design center
- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and
complexity in modern big data
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream
for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC, JDBC,
REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index
Why columnar storage?
Why Spark?
● Blends streaming, interactive, and batch analytics
● Appeals to Java, R, Python, Scala folks
● Succinct programs
● Rich set of transformations and libraries
● RDD and fault tolerance without replication
● Stream processing with high throughput
Spark Myths
● It is a distributed in-memory database
○ It’s a computational framework with immutable caching
● It is Highly Available
○ Fault tolerance is not the same as HA
● Well suited for real time, operational environments
○ Does not handle concurrency well
Common Spark Streaming Architecture
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Queue is buffered in
executor. Driver
submits batch job
every second. This
results in a new RDD
pushed to
stream(batch from
buffer)
Short term immutable state.
Long term – In external DB
Challenge: Spark driver not HA
Driver
Executor – spark engine
Executor – spark engine
Client
submits
stream App
If Driver fails – Executors
automatically exit
All CACHED STATE
HAS TO BE
RE_HYDRATED
Challenge: Sharing state
DriverClient1
Executor
• Spark designed for total
isolation across client apps
• Sharing state across
clients requires external
DB/Tachyon
Executor
DriverClient2
Executor
Executor
Challenge: External state management
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
time
cassandra
Kafka
queue
Client
submits
stream App
Key based access might keep up
But, Joins, analytic operators is a problem.
Serialization, copying costs are too high,
esp in JVMs
newDStream = wordDstream.updateStateByKey[Int] (func)
- Spark capability to update state as batches
arrive requires full iteration over RDD
Challenge: “Once and only once” = hard
Executor
Executor
Recovered partition
cassandra
X = 10
X = 20
X = 30
X = X+10
X = X+10
OK
Challenge: Always on
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
HA requirement : If something fails, there is
always a redundant copy that is fully in sync.
Failover is instantaneous
Fault tolerance in Spark: Recover state from
the original source or checkpoint by
tracking lineage. Can take too long.
Challenge: Concurrent queries too slow
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB
SnappyData: P2P cluster w/ consensus
Data
Server
JVM1
Data
Server
JVM2
Data
Server
JVM3
● Cluster elects a coordinator
● Consistent views across
members
● Virtual synchrony across
members
● WHY? Strong consistency
during replication, failure
detection is accurate and
fast
Colocated row/column Tables in Spark
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
● Spark Executors are long lived and shared across multiple apps
● Gem Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
Linearly scale with shared partitions
Spark Executor
Spark Executor
Kafka
queue
Subscriber N-Z
Subscriber A-M
Subscriber A-M
Ref data
Linearly scale with partition pruning
Input queue,
Stream, IMDB,
Output queue
all share the
same
partitioning
strategy
Point access, updates, fast writes
● Row tables with PKs are distributed HashMaps
○ with secondary indexes
● Support for transactional semantics
○ read_committed, repeatable_read
● Support for scalable high write rates
○ streaming data goes through stages
○ queue streams, intermediate storage (Delta row buffer),
immutable compressed columns
Full Spark Compatibility
● Any table is also visible as a DataFrame
● Any RDD[T]/DataFrame can be stored in SnappyData tables
● Tables appear like any JDBC sourced table
○ But, in executor memory by default
● Addtional API for updates, inserts, deletes
//Save a dataFrame using the spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
Key feature: Synopses Data
● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data
○ TopK for time series (using time aggregation CMS, item aggregation)
○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS
Stratified Sampling Spark Demo
www.snappydata.io
Driver HA, JobServer for interactive jobs
● REST based JobServer for sharing a single Context across clients
○ clients use REST to execute streaming jobs, queries, DML
○ secondary JobServer for HA
○ primary election using Gem clustering
● Native SnappyData cluster manager for long running executors
○ makes resources (executors) long running
○ resuse same executors across apps, jobs
● Low latency scheduling that skips the Spark driver altogether
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar form
reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source
● Beta will be on github before December. We are looking for
contributors!
● Learn more & register for beta: www.snappydata.io
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ linkedin: www.linkedin.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com
○ IRC: irc.freenode.net #snappydata
Extras
www.snappydata.io
OLAP/OLTP with Synopses
CQ
Subscriptions
OLAP Query
Engine
Micro Batch
Processing
Module
(Plugins)
Sliding Window
Emits Batches
[ ]
User
Applications
processing
Events &
Issuing
Interactive
Queries
Summary DB
▪ Time Series with decay
▪ TopK, Frequency Summary
Structures
▪ Counters
▪ Histograms
▪ Stratified Samples
▪ Raw Data Windows
Exact DB
(Row + column
oriented)
Not pancea, but comes close
● Synopses require prior workload knowledge
● Not all queries … complex queries will result in high error rates
○ single cluster for stream ingestion and analytics queries (both
streaming and interactive)
● Our strategy - be adjunct to MPP databases...
○ first compute the error estimate; if error is above tolerance
delegate to exact store
Adjunct store in certain scenarios
Speed/Accuracy tradeoffError
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
41
Stratified Sampling
● Random sampling has intuitive semantics
● However, data is typically skewed and our queries are multi-
dimensional
○ avg sales order price for each product class for each geography
○ some products may have little to no sales
○ stratification ensures that each “group” (product class) is represented
Stratified Sampling Challenges
● Solutions exist for batch data (BlinkDB)
● Needs to work for infinite streams of data
○ Answer: use combination of Stratified with other techniques like
Bernouli/reservoir sampling
○ Exponentially decay over time
Dealing with errors and latency
● Well known error techniques for “closed form aggregations”
● Exploring other techniques -- Analytical Bootstrap
● User can specify error bound with confidence interval
SELECT avg(sessionTime) FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
● Engine would determine if it can satisfy error bound first
● If not, delegate execution to an “exact” store (GPDB, etc)
● Query execution can also be latency bounded
Sketching techniques
● Sampling not effective for outlier detection
○ MAX/MIN etc
● Other probabilistic structures like CMS, heavy hitters, etc
● We implemented Hokusai
○ capture frequencies of items in time series
● Design permits TopK queries over arbitrary trim intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100
Demo
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
A new approach to Real Time Analytics
Streaming
Analytics
Probabilistic
data
Distributed
In-Memory
SQL
Deep integration
of Spark + Gem
Unified cluster, AlwaysOn, Cloud ready
For Real time analytics
Vision – Drastically reduce the cost and complexity in modern big
data. …Using fraction of the resources
10X better response time, drop resource cost 10X,
reduce complexity 10X
Deep Scale, High
volume
MPP DB
Integrate
with

Mais conteúdo relacionado

Mais procurados

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 

Mais procurados (20)

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 

Destaque

Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland
 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed DatawarehousingJos van Dongen
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Cambridge Semantics
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleCambridge Semantics
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraRobbie Strickland
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosDataWorks Summit
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraRobbie Strickland
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesCambridge Semantics
 
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Cambridge Semantics
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 

Destaque (20)

GemFire In-Memory Data Grid
GemFire In-Memory Data GridGemFire In-Memory Data Grid
GemFire In-Memory Data Grid
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big Data
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed Datawarehousing
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise Scale
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on Cassandra
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational Databases
 
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 

Semelhante a SnappyData overview NikeTechTalk 11/19/15

Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZDATABIZit
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Scala Italy
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesSnappyData
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataVMware Tanzu
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataCarlos Andrés García
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriChetan Khatri
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 

Semelhante a SnappyData overview NikeTechTalk 11/19/15 (20)

Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZ
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

SnappyData overview NikeTechTalk 11/19/15

  • 1. SnappyData Getting Spark ready for real-time, operational analytics www.snappydata.io Jags Ramnarayan jramnarayan@snappydata.io Co-founder SnappyData Nov 2015
  • 2. SnappyData - an EMC/Pivotal spin out ● New Spark-based open source project started by Pivotal GemFire founders+engineers ● Decades of in-memory data management experience ● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database www.snappydata.io
  • 3. Lambda Architecture (LA) for Analytics
  • 4. Perspective on LA for real time In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts
  • 5. Use case: Telemetry Revenue Generation Real-time Location based Mobile Advertising (B2B2C) Location Based Services (B2C, B2B, B2B2C) Revenue Protection Customer experience management to reduce churn Customers Sentiment analysis Network Efficiency Network bandwidth optimisation Network signalling maximisation • Network optimization – E.g. re-reroute call to another cell tower if congestion detected • Location based Ads – Match incoming event to Subscriber profile; If ‘Opt-in’ show location sensitive Ad • Challenge: Too much streaming data – Many subscribers, lots of 2G/3G/4G voice/data – Network events: location events, CDRs, network issues
  • 6. Challenge - Keeping up with streams In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts • Millions of events/sec • HA – Continuously Ingest • Cannot throttle the stream • Diverse formats
  • 7. Challenge - Transform is expensive In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts • Filter, Normalize, transform • Need reference data to normalize – point lookups Reference DB (Enterprise Oracle, …)
  • 8. Challenge - Stream joins, correlations In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts Analyze over time window ● Simple rules - (CallDroppedCount > threshold) then alert ● Or, Complex (OLAP like query) ● TopK, Trending, Join with reference data, correlate with history How do you keep up with OLAP style analytics with millions of events in window and billions of records in ref data?
  • 9. Challenge - State management In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts Manage generated state ● Mutating state: millions of counters ● “Once and only once” ● Consistency across distributed system ● State HA
  • 10. Challenge - Interactive Query speed In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts Interactive queries - OLAP style queries - High concurrency - Low response time
  • 11. Today: queue -> process -> NoSQL Messaging cluster adds extra hops, management No distributed, HA Data store Streaming joins, or with external state is slow and not scalable in many cases
  • 12. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-time design center - Low latency, HA, concurrent Vision: Drastically reduce the cost and complexity in modern big data
  • 13. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial project on Approximate Query Processing(AQP) MPP DB Index
  • 15. Why Spark? ● Blends streaming, interactive, and batch analytics ● Appeals to Java, R, Python, Scala folks ● Succinct programs ● Rich set of transformations and libraries ● RDD and fault tolerance without replication ● Stream processing with high throughput
  • 16. Spark Myths ● It is a distributed in-memory database ○ It’s a computational framework with immutable caching ● It is Highly Available ○ Fault tolerance is not the same as HA ● Well suited for real time, operational environments ○ Does not handle concurrency well
  • 17. Common Spark Streaming Architecture Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Queue is buffered in executor. Driver submits batch job every second. This results in a new RDD pushed to stream(batch from buffer) Short term immutable state. Long term – In external DB
  • 18. Challenge: Spark driver not HA Driver Executor – spark engine Executor – spark engine Client submits stream App If Driver fails – Executors automatically exit All CACHED STATE HAS TO BE RE_HYDRATED
  • 19. Challenge: Sharing state DriverClient1 Executor • Spark designed for total isolation across client apps • Sharing state across clients requires external DB/Tachyon Executor DriverClient2 Executor Executor
  • 20. Challenge: External state management Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time time cassandra Kafka queue Client submits stream App Key based access might keep up But, Joins, analytic operators is a problem. Serialization, copying costs are too high, esp in JVMs newDStream = wordDstream.updateStateByKey[Int] (func) - Spark capability to update state as batches arrive requires full iteration over RDD
  • 21. Challenge: “Once and only once” = hard Executor Executor Recovered partition cassandra X = 10 X = 20 X = 30 X = X+10 X = X+10 OK
  • 22. Challenge: Always on Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App HA requirement : If something fails, there is always a redundant copy that is fully in sync. Failover is instantaneous Fault tolerance in Spark: Recover state from the original source or checkpoint by tracking lineage. Can take too long.
  • 23. Challenge: Concurrent queries too slow SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
  • 24. SnappyData: P2P cluster w/ consensus Data Server JVM1 Data Server JVM2 Data Server JVM3 ● Cluster elects a coordinator ● Consistent views across members ● Virtual synchrony across members ● WHY? Strong consistency during replication, failure detection is accurate and fast
  • 25. Colocated row/column Tables in Spark Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing ● Spark Executors are long lived and shared across multiple apps ● Gem Memory Mgr and Spark Block Mgr integrated
  • 26. Table can be partitioned or replicated Replicated Table Partitioned Table (Buckets A-H) Replicated Table Partitioned Table (Buckets I-P) consistent replica on each node Partition Replica (Buckets A-H) Replicated Table Partitioned Table (Buckets Q-W)Partition Replica (Buckets I-P) Data partitioned with one or more replicas
  • 27. Linearly scale with shared partitions Spark Executor Spark Executor Kafka queue Subscriber N-Z Subscriber A-M Subscriber A-M Ref data Linearly scale with partition pruning Input queue, Stream, IMDB, Output queue all share the same partitioning strategy
  • 28. Point access, updates, fast writes ● Row tables with PKs are distributed HashMaps ○ with secondary indexes ● Support for transactional semantics ○ read_committed, repeatable_read ● Support for scalable high write rates ○ streaming data goes through stages ○ queue streams, intermediate storage (Delta row buffer), immutable compressed columns
  • 29. Full Spark Compatibility ● Any table is also visible as a DataFrame ● Any RDD[T]/DataFrame can be stored in SnappyData tables ● Tables appear like any JDBC sourced table ○ But, in executor memory by default ● Addtional API for updates, inserts, deletes //Save a dataFrame using the spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
  • 30. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  • 31. Key feature: Synopses Data ● Maintain stratified samples ○ Intelligent sampling to keep error bounds low ● Probabilistic data ○ TopK for time series (using time aggregation CMS, item aggregation) ○ Histograms, HyperLogLog, Bloom Filters, Wavelets CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS ( BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
  • 32. Stratified Sampling Spark Demo www.snappydata.io
  • 33. Driver HA, JobServer for interactive jobs ● REST based JobServer for sharing a single Context across clients ○ clients use REST to execute streaming jobs, queries, DML ○ secondary JobServer for HA ○ primary election using Gem clustering ● Native SnappyData cluster manager for long running executors ○ makes resources (executors) long running ○ resuse same executors across apps, jobs ● Low latency scheduling that skips the Spark driver altogether
  • 34.
  • 35. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 36. www.snappydata.io SnappyData is Open Source ● Beta will be on github before December. We are looking for contributors! ● Learn more & register for beta: www.snappydata.io ● Connect: ○ twitter: www.twitter.com/snappydata ○ facebook: www.facebook.com/snappydata ○ linkedin: www.linkedin.com/snappydata ○ slack: http://snappydata-slackin.herokuapp.com ○ IRC: irc.freenode.net #snappydata
  • 38. OLAP/OLTP with Synopses CQ Subscriptions OLAP Query Engine Micro Batch Processing Module (Plugins) Sliding Window Emits Batches [ ] User Applications processing Events & Issuing Interactive Queries Summary DB ▪ Time Series with decay ▪ TopK, Frequency Summary Structures ▪ Counters ▪ Histograms ▪ Stratified Samples ▪ Raw Data Windows Exact DB (Row + column oriented)
  • 39. Not pancea, but comes close ● Synopses require prior workload knowledge ● Not all queries … complex queries will result in high error rates ○ single cluster for stream ingestion and analytics queries (both streaming and interactive) ● Our strategy - be adjunct to MPP databases... ○ first compute the error estimate; if error is above tolerance delegate to exact store
  • 40. Adjunct store in certain scenarios
  • 41. Speed/Accuracy tradeoffError 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) 41
  • 42. Stratified Sampling ● Random sampling has intuitive semantics ● However, data is typically skewed and our queries are multi- dimensional ○ avg sales order price for each product class for each geography ○ some products may have little to no sales ○ stratification ensures that each “group” (product class) is represented
  • 43. Stratified Sampling Challenges ● Solutions exist for batch data (BlinkDB) ● Needs to work for infinite streams of data ○ Answer: use combination of Stratified with other techniques like Bernouli/reservoir sampling ○ Exponentially decay over time
  • 44. Dealing with errors and latency ● Well known error techniques for “closed form aggregations” ● Exploring other techniques -- Analytical Bootstrap ● User can specify error bound with confidence interval SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0% ● Engine would determine if it can satisfy error bound first ● If not, delegate execution to an “exact” store (GPDB, etc) ● Query execution can also be latency bounded
  • 45. Sketching techniques ● Sampling not effective for outlier detection ○ MAX/MIN etc ● Other probabilistic structures like CMS, heavy hitters, etc ● We implemented Hokusai ○ capture frequencies of items in time series ● Design permits TopK queries over arbitrary trim intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  • 46. Demo Zeppelin Spark Interpreter (Driver) Zeppelin Server Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM
  • 47. A new approach to Real Time Analytics Streaming Analytics Probabilistic data Distributed In-Memory SQL Deep integration of Spark + Gem Unified cluster, AlwaysOn, Cloud ready For Real time analytics Vision – Drastically reduce the cost and complexity in modern big data. …Using fraction of the resources 10X better response time, drop resource cost 10X, reduce complexity 10X Deep Scale, High volume MPP DB Integrate with

Notas do Editor

  1. Rather than the master-worker pattern in Spark, we internally can make all the Spark executors become aware of each other. In fact, we start a full fledged p2p consensus based distributed system. Essentially, there is a coordinator elected within the members and every member joining or leaving always notifies the coordinator who then makes sure that all members have the same view of the membership of the system. We make sure that the core properties like view consistency and virtual synchrony are ensured in the system exposed to failures and Join/leave.
  2. By default, we start the Spark cluster in an “embedded” mode. i.e. the in-memory store is fully collocated and in the same process space. We had to change the spark Block manager so both Gem as well as spark shares the same space for tables, cached RDDs, shuffle space, sorting, etc. This space can extend from JVM heap to off-heap. GemFire proactively monitors the JVM “old gen” so never goes beyond a certain critical threshold. i.e. we do a number of things so you don’t run OOM. Hoping to contribute this back to Spark. And, when running in the embedded mode we also make sure the executors are long lived. i.e. life cycle for these nodes are no longer tied to the Driver availability. Everything Spark does it cleaned up as expected though.
  3. Partitioning strategy, by default, is the same as Spark. We try to do uniform random distribution of the records across all the nodes designated to host a partitioned table. Any table can have one or more replicas. Replicas are always consistent with each other - sync writes. We parallely send the write to each replica and wait for ACKs. If a ACK is not received we start SUSPECT processing. Replicated tables, by default, are replicated to each node. Replicas are guaranteed to be consistent when failures occur i.e. the failed node rejoins. How do you recreate the state of the replica while thousands of other concurrent writes are in progress is a hard problem to solve.
  4. And, of course, the whole point behind colocation is to linear scale with minimal or even no shuffling. So, for instance, when using Kafka, all three components - Kafka, native RDD in Spark and Table in Snappy can all share the same partitioning strategy. As an example, in our telco case, all records associated with a subscriber can be colocated onto the same node - the queue, spark procesing of partitions and related reference data in Snappy store.
  5. In the current release, Column tables have the same semantics like Spark - there can be no constraints. No PK either. But, all row tables are SQL compliant. PKs, FKs, constraints. Transaction support is both Repeatable_read and read_committed. To achieve high throughput, writes typically goes through stages. Especially when inserting into column tables. Initially it gets stored in “delta row buffer” that is periodically emptied or aged into column store. Like mentioned before, column are stored in arrays/Bytebuffers with compression.
  6. There is a reciprocal relationship with Spark RDDs/DataFrames. any table is visible as a DataFrame and vice versa. Hence, all the spark APIs, tranformations can also be applied to snappy managed tables. For instance, you can use the DataFrame data source API to save any arbitrary DataFrame into a snappy table like shown in the example. One cool aspect of Spark is its ability to take an RDD of objects (say with nested structure) and implicitly infer its schema. i.e. turn into into a DataFrame and store it.
  7. The SQL dialect will be Spark SQL ++. i.e. we are extending SQL to be much more compliant with standard SQL. A number of the extensions that dictate things like HA, disk persistence, etc are all specified through OPTIONS in spark SQL.
  8. When it comes to interactive analytics a lot is exploratory in nature. Folks are looking trends for different time periods, studying outlier patterns, etc. Unfortunately, like pointed out before, analytic queries can take a llong time even when in-memory. We want such exploratory analytics to ultimately happen at google like speeds. Don’t break the speed of thought. In many cases, do we really need a precise answer? like watching a trend on a visualization tool. We are thowing linear improvements to what seems like an exponential problem - like in some IoT scenarios. Stratified sampling allows the user to more intelligently sample so we can answer queries with a very small fraction of the data with good accuracy. What we do is allow the user to create one or more stratified samples on some “base” table data. The base table maybe all in-memory also or more often than not, could reside in HDFS.