SlideShare uma empresa Scribd logo
1 de 48
Explore big data at speed of thought
with Spark 2.0 and SnappyData
www.snappydata.io
Jags Ramnarayan
CTO, Co-founder @ SnappyData
Our Pedigree
SnappyDat
a
SpinOut
● New Spark-based open source
project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data
management experience
● Focus on real-time, operational
analytics: Spark inside an
OLTP+OLAP databaseFunded by Pivotal, GE, GTD Capital
Our Mission
Spark Executor Disparate data formats
… JSON, CSV, Parquet..
DB Tier
(NoSQL, SQL, ..)
Spark Cluster is for COMPUTE
Spark
Jobs
S3, HDFS, Files…
Ephemeral,
read-only
STATE
Spark is a Compute engine that works with disparate databases
Our Mission – Spark cluster is also an Operational DB
Spark Executor
Spark Cluster is for COMPUTE
Spark
Jobs
S3, HDFS,
Files…
Spark
readOnly
Cache
Deep fusion of Spark with hybrid in-memory database – OLTP, OLAP
SnappyData
- Support mutability, transactions
- Point lookups, updates
- higher performance, less complex
- SQL compliant (Not just selects)
- HA (Replication across geos)
- Persistence: backup, recovery
- Far fewer resources (Synopses)
Focus for this talk
• Is Operational Analytics – Interactive Analytic query processing
• Improvements in Spark SQL performance
• Why is in-memory analytics still challenging?
• The SnappyData solution – brief overview (will not dive into Hybrid DB)
• Synopses Data Engine – focus on Stratified sampling
• Demo using Zeppelin
• Q&A
DataFrame(DF) and Query plan in Spark
• Distributed data organized as named columns
- Similar to R/Python DataFrame
• But, with richer transformations, optimizations
• Can be created from many disparate sources
• Any SQL in Spark when compiled is expressed
as transformations on DFs
Scan
Project
Aggregate
Join
Filter
Data
DataFrame
DataFrame
DataFrame
DataFrame
DataFrame
select AVG(ArrDelay) arrivalDelay,
UniqueCarrier carrier from airline JOIN history
where <filter> group by UniqueCarrier
Is this fast enough?
- Spark 1.6, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM
- Airline OnTime performance data set, 105 Million records
select AVG(ArrDelay) from airline ~ 3 seconds ~ 2 seconds
select AVG(ArrDelay) arrivalDelay,
UniqueCarrier carrier from airline
group by UniqueCarrier order by
UniqueCarrier
Parquet files in OS
Buffer
Managed in Spark
memory
~ 10 seconds ~ 6 seconds
Spark 1.6 query plan
What is expensive?
Scan over 105 million Integers
select AVG(ArrDelay) from airline
Shuffle results from each partition
so we can compute Avg across all
partitions
- is cheap in this case … only 11
partitions
How did Spark 2.0 do?
- Spark 2.0, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM
- Airline OnTime performance data set, 105 Million records
Parquet files in OS
Buffer
Managed in Spark
memory
~ 3 seconds ~ 600 millisecondsselect AVG(ArrDelay) from airline
More than 3X faster than Spark 1.6
Spark 2.0 query plan
What is different?
Scan over 105 million Integers
is much faster now
Shuffle results from each partition
so we can compute Avg across all
partitions
- is cheap in this case … only 11
partitions
select AVG(ArrDelay) from airline
Whole Stage Code Generation
- Each Operator implemented using
functions
- And, functions imply chasing
pointers … Expensive
- Code Generation
-- Remove virtual function calls
-- Array, variables instead of objects
-- Capitalize on modern CPU cache
Aggregate
Filter
Scan
Project
How to remove complexity? Add a layer
How to improve perf? Remove a layer
Filter() {
getNextRow {
get a row from scan() //child
Apply filter condition
true: return row
}
Scan() {
getNextRow {
get row from fileInputStream
}
Why columnar storage in-memory?
Source: MonetDB
Good enough? Hitting the CPU Wall?
select
count(*) , airlineName
From history t1, current t2, airports t3
Where t1 Join t2 Join t3
group by description
order by count desc limit 8
Distributed Joins can be very expensive
0
20
40
60
80
100
120
140
160
180
200
1 10
ConcurrencyConcurrency
ResponseTime in
seconds
ResponseTime
in seconds
Moving, Copying costs
• Aggregations – GroupBy, MapReduce
• Joins with other streams, Reference data
Shuffle Costs (Copying, Serialization) Excessive copying in
Java based Scale out stores
- DRAM is still relatively expensive for the deluge of data
- Analytics in the cloud requires fluid data movement
-- How do you move large volumes to/from clouds?
Challenges with In-memory Analytics
• Most apps happy to tradeoff 1% accuracy
for 200x speedup!
• Can usually get a 99.9% accurate answer by only
looking at a tiny fraction of data!
• Often can make perfectly accurate
decisions without having perfectly
accurate answers!
• A/B Testing, visualization, ...
• The data itself is usually noisy
• Processing entire data doesn’t necessarily mean
exact answers!
• Inference is probabilistic anyway
Use statistical techniques to shrink data?
SnappyData
A Hybrid Open source system for Transactions, Analytics,
Streaming
(https://github.com/SnappyDataInc/snappydata)
SnappyData – In-memory Hybrid DB with Spark
A Single Unified Cluster: OLTP + OLAP + Streaming
for real-time analytics
Batch design, high throughput
Real-time design
Low latency, HA,
concurrency
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over 13 years
Maintain recent data in-memory, lazily fetch from source
Process, store
streams
Kafka
Snappy Data Server – Spark Executor + Store
Batch
compute
Reference data
Lazy write, Fetch on
demand
RDB
HDFS
In-memory compute, state
Current
Operational
data
External data S3, Rdb, MPP DB…Spark API ++
- Java, Scala,
Python, R, REST
Synopses data
Interactive analytic queries
History data
Realizing ‘speed-of-thought’ Analytics
Rows
Columnar
Stream processing
Kafka
Queue
(partition)
Snappy Data Server – Spark Executor + Store
Index
Process
Spark or SQL
Program
Batch
compute
Hybrid Store
RDB
(Reference data)
HDFS
MPP DB
In-memory compute, state
overflow
Local
persist
Spark API ++
- Java, Scala,
Python, R, REST
Synopse
s
Interactive analytic queries(SQL, JDBC, ODBC)
• Fast
- Stream, ingested data colocated on shared key
- Tables colocated on shared key
- Far less copying, serialization
- Improvements to vectorization (20X faster than spark)
• Use less memory, CPU
- Maintain only “Hot/active” data in RAM
- Summarize all data using Synopses
• Flexible
- Spark. Enough said.
Fast, Fewer resources, Flexible
Features
- Deeply integrated database for Spark
- 100% compatible with Spark
- Extensions for Transactions (updates), SQL stream processing
- Extensions for High Availability
- Approximate query processing for interactive OLAP
- OLTP+OLAP Store
- Replicated and partitioned tables
- Tables can be Row or Column oriented (in-memory & on-disk)
- SQL extensions for compatibility with SQL Standard
- create table, view, indexes, constraints, etc
TPC-H: 10X-20X faster than Spark 2.0
Synopses Data Engine
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Uniform Sample
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 1/3
5 adv20 NY 0.0001 1/3
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Uniform Sample
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 2/3
5 adv20 NY 0.0001 2/3
1 adv10 NY 0.0001 2/3
2 adv10 VT 0.0005 2/3
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table Larger
Stratified Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Stratified Sample on Geo
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 1/4
2 adv10 VT 0.0005 1/2
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table
Value of Sampling grows with volume
Select avg(Bid), Advertiser from T1 group by Advertiser
Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1
Speed/Accuracy tradeoffError(%)
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time 28
100 secs
2 secs
1% Error
Query execution with accuracy guarantee
PARSE
QUERY
Can Query be
executed on
Samples?
- Recent time window
- Computable from samples
- Within error constraints
- Point query on history
- Outlier query
- Very complex query
Parallely
Execute on
Base table
In-memory
Execution with
Error bar
Response
Response
No
Yes
Synopses Data Engine Features
• Support for uniform sampling
• Support for stratified sampling
- Solutions exist for stored data (BlinkDB)
- SnappyData works for infinite streams of data too
• Support for exponentially decaying windows over time
• Support for synopses
- Top-K queries, heavy hitters, outliers, ...
• [future] Support for joins
• Workload mining (http://CliffGuard.org)
Sketching techniques
● Sampling not effective for outlier detection
○ MAX/MIN etc
● Other probabilistic structures like CMS, heavy hitters, etc
● SnappyData implements Hokusai
○ Capturing item frequencies in timeseries
● Design permits TopK queries over arbitrary time intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100
Synopses Data Engine Demo
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Free Cloud trial service – Project iSight
● Free AWS/Azure credits for folks to try out SnappyData
● One click launch of private SnappyData cluster with Zeppelin
● Multiple notebooks with comprehensive description of concepts and
value
● Bring your own data sets to try ‘Instant visualization’ using Synopses
data
Send email to chomp@snappydata.io to be notified. Anticipate release in next 2 weeks
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source
● Ad Analytics example/benchmark -
https://github.com/SnappyDataInc/snappy-poc
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com
EXTRAS
Use Case Patterns
1. Operational Analytics DB
- Caching for Analytics over disparate sources
- Federate query between samples and backend’
2. Stream analytics for Spark
Process streams, transform, real-time scoring, store, query
3. In-memory transactional store
Highly concurrent apps, SQL cache, OLTP + OLAP
How SnappyData Extends
Spark
Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >
Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")
Unified Cluster Architecture
How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication
How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML
AdImpression Demo
Spark, SQL Code Walkthrough, interactive SQL
Concurrent Ingest + Query Performance
• AWS 4 c4.2xlarge instances
- 8 cores, 15GB mem
• Each node parallely ingests stream from
Kafka
• Parallel batch writes to store (32
partitions)
• Only few cores used for Stream writes
as most of CPU reserved for
OLAP queries
0
100000
200000
300000
400000
500000
600000
700000
Spark-
Cassandra
Spark-
InMemoryDB
SnappyData
Series1 322000 480000 670000
Persecond
Throughput
Stream ingestion rate
(On 4 nodes with cap on CPU to allow for queries)
https://github.com/SnappyDataInc/snappy-poc
2X – 45X faster (vs Cassandra, IMDB)
Concurrent Ingest + Query Performance
0
10000
20000
30000
40000
30M
60M
90M
30M
60M
90M
30M
60M
90M
Spark-Cassandra
Spark-InMemoryDBl
SnappyData
20346
65061 93960
3649 5801 7295
1056
1571
2144
Q1
Sample “scan” oriented OLAP query(Spark SQL) performance executed
while ingesting data
select count(*) AS adCount, geo from adImpressions
group by geo order by adCount desc limit 20;
Response
Time(millis)
https://github.com/SnappyDataInc/snappy-poc
2X – 45X faster

Mais conteúdo relacionado

Mais procurados

Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleDatabricks
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...DataStax
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 

Mais procurados (20)

Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Conviva spark
Conviva sparkConviva spark
Conviva spark
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 

Destaque

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Data Con LA
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 

Destaque (20)

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
 
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 

Semelhante a Explore big data at speed of thought with Spark 2.0 and Snappydata

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesSnappyData
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 

Semelhante a Explore big data at speed of thought with Spark 2.0 and Snappydata (20)

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 

Mais de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Mais de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Explore big data at speed of thought with Spark 2.0 and Snappydata

  • 1. Explore big data at speed of thought with Spark 2.0 and SnappyData www.snappydata.io Jags Ramnarayan CTO, Co-founder @ SnappyData
  • 2. Our Pedigree SnappyDat a SpinOut ● New Spark-based open source project started by Pivotal GemFire founders+engineers ● Decades of in-memory data management experience ● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP databaseFunded by Pivotal, GE, GTD Capital
  • 3. Our Mission Spark Executor Disparate data formats … JSON, CSV, Parquet.. DB Tier (NoSQL, SQL, ..) Spark Cluster is for COMPUTE Spark Jobs S3, HDFS, Files… Ephemeral, read-only STATE Spark is a Compute engine that works with disparate databases
  • 4. Our Mission – Spark cluster is also an Operational DB Spark Executor Spark Cluster is for COMPUTE Spark Jobs S3, HDFS, Files… Spark readOnly Cache Deep fusion of Spark with hybrid in-memory database – OLTP, OLAP SnappyData - Support mutability, transactions - Point lookups, updates - higher performance, less complex - SQL compliant (Not just selects) - HA (Replication across geos) - Persistence: backup, recovery - Far fewer resources (Synopses)
  • 5. Focus for this talk • Is Operational Analytics – Interactive Analytic query processing • Improvements in Spark SQL performance • Why is in-memory analytics still challenging? • The SnappyData solution – brief overview (will not dive into Hybrid DB) • Synopses Data Engine – focus on Stratified sampling • Demo using Zeppelin • Q&A
  • 6. DataFrame(DF) and Query plan in Spark • Distributed data organized as named columns - Similar to R/Python DataFrame • But, with richer transformations, optimizations • Can be created from many disparate sources • Any SQL in Spark when compiled is expressed as transformations on DFs Scan Project Aggregate Join Filter Data DataFrame DataFrame DataFrame DataFrame DataFrame select AVG(ArrDelay) arrivalDelay, UniqueCarrier carrier from airline JOIN history where <filter> group by UniqueCarrier
  • 7. Is this fast enough? - Spark 1.6, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM - Airline OnTime performance data set, 105 Million records select AVG(ArrDelay) from airline ~ 3 seconds ~ 2 seconds select AVG(ArrDelay) arrivalDelay, UniqueCarrier carrier from airline group by UniqueCarrier order by UniqueCarrier Parquet files in OS Buffer Managed in Spark memory ~ 10 seconds ~ 6 seconds
  • 8. Spark 1.6 query plan What is expensive? Scan over 105 million Integers select AVG(ArrDelay) from airline Shuffle results from each partition so we can compute Avg across all partitions - is cheap in this case … only 11 partitions
  • 9. How did Spark 2.0 do? - Spark 2.0, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM - Airline OnTime performance data set, 105 Million records Parquet files in OS Buffer Managed in Spark memory ~ 3 seconds ~ 600 millisecondsselect AVG(ArrDelay) from airline More than 3X faster than Spark 1.6
  • 10. Spark 2.0 query plan What is different? Scan over 105 million Integers is much faster now Shuffle results from each partition so we can compute Avg across all partitions - is cheap in this case … only 11 partitions select AVG(ArrDelay) from airline
  • 11. Whole Stage Code Generation - Each Operator implemented using functions - And, functions imply chasing pointers … Expensive - Code Generation -- Remove virtual function calls -- Array, variables instead of objects -- Capitalize on modern CPU cache Aggregate Filter Scan Project How to remove complexity? Add a layer How to improve perf? Remove a layer Filter() { getNextRow { get a row from scan() //child Apply filter condition true: return row } Scan() { getNextRow { get row from fileInputStream }
  • 12. Why columnar storage in-memory? Source: MonetDB
  • 13. Good enough? Hitting the CPU Wall? select count(*) , airlineName From history t1, current t2, airports t3 Where t1 Join t2 Join t3 group by description order by count desc limit 8 Distributed Joins can be very expensive 0 20 40 60 80 100 120 140 160 180 200 1 10 ConcurrencyConcurrency ResponseTime in seconds ResponseTime in seconds
  • 14. Moving, Copying costs • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores
  • 15. - DRAM is still relatively expensive for the deluge of data - Analytics in the cloud requires fluid data movement -- How do you move large volumes to/from clouds? Challenges with In-memory Analytics
  • 16. • Most apps happy to tradeoff 1% accuracy for 200x speedup! • Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data! • Often can make perfectly accurate decisions without having perfectly accurate answers! • A/B Testing, visualization, ... • The data itself is usually noisy • Processing entire data doesn’t necessarily mean exact answers! • Inference is probabilistic anyway Use statistical techniques to shrink data?
  • 17. SnappyData A Hybrid Open source system for Transactions, Analytics, Streaming (https://github.com/SnappyDataInc/snappydata)
  • 18. SnappyData – In-memory Hybrid DB with Spark A Single Unified Cluster: OLTP + OLAP + Streaming for real-time analytics Batch design, high throughput Real-time design Low latency, HA, concurrency Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  • 19. Maintain recent data in-memory, lazily fetch from source Process, store streams Kafka Snappy Data Server – Spark Executor + Store Batch compute Reference data Lazy write, Fetch on demand RDB HDFS In-memory compute, state Current Operational data External data S3, Rdb, MPP DB…Spark API ++ - Java, Scala, Python, R, REST Synopses data Interactive analytic queries History data
  • 20. Realizing ‘speed-of-thought’ Analytics Rows Columnar Stream processing Kafka Queue (partition) Snappy Data Server – Spark Executor + Store Index Process Spark or SQL Program Batch compute Hybrid Store RDB (Reference data) HDFS MPP DB In-memory compute, state overflow Local persist Spark API ++ - Java, Scala, Python, R, REST Synopse s Interactive analytic queries(SQL, JDBC, ODBC)
  • 21. • Fast - Stream, ingested data colocated on shared key - Tables colocated on shared key - Far less copying, serialization - Improvements to vectorization (20X faster than spark) • Use less memory, CPU - Maintain only “Hot/active” data in RAM - Summarize all data using Synopses • Flexible - Spark. Enough said. Fast, Fewer resources, Flexible
  • 22. Features - Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc
  • 23. TPC-H: 10X-20X faster than Spark 2.0
  • 25. Uniform (Random) Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/3 5 adv20 NY 0.0001 1/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  • 26. Uniform (Random) Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 2/3 5 adv20 NY 0.0001 2/3 1 adv10 NY 0.0001 2/3 2 adv10 VT 0.0005 2/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table Larger
  • 27. Stratified Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Stratified Sample on Geo ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/4 2 adv10 VT 0.0005 1/2 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  • 28. Value of Sampling grows with volume Select avg(Bid), Advertiser from T1 group by Advertiser Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1 Speed/Accuracy tradeoffError(%) 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time 28 100 secs 2 secs 1% Error
  • 29. Query execution with accuracy guarantee PARSE QUERY Can Query be executed on Samples? - Recent time window - Computable from samples - Within error constraints - Point query on history - Outlier query - Very complex query Parallely Execute on Base table In-memory Execution with Error bar Response Response No Yes
  • 30. Synopses Data Engine Features • Support for uniform sampling • Support for stratified sampling - Solutions exist for stored data (BlinkDB) - SnappyData works for infinite streams of data too • Support for exponentially decaying windows over time • Support for synopses - Top-K queries, heavy hitters, outliers, ... • [future] Support for joins • Workload mining (http://CliffGuard.org)
  • 31. Sketching techniques ● Sampling not effective for outlier detection ○ MAX/MIN etc ● Other probabilistic structures like CMS, heavy hitters, etc ● SnappyData implements Hokusai ○ Capturing item frequencies in timeseries ● Design permits TopK queries over arbitrary time intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  • 32. Synopses Data Engine Demo Zeppelin Spark Interpreter (Driver) Zeppelin Server Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM
  • 33. Free Cloud trial service – Project iSight ● Free AWS/Azure credits for folks to try out SnappyData ● One click launch of private SnappyData cluster with Zeppelin ● Multiple notebooks with comprehensive description of concepts and value ● Bring your own data sets to try ‘Instant visualization’ using Synopses data Send email to chomp@snappydata.io to be notified. Anticipate release in next 2 weeks
  • 34. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 35. www.snappydata.io SnappyData is Open Source ● Ad Analytics example/benchmark - https://github.com/SnappyDataInc/snappy-poc ● https://github.com/SnappyDataInc/snappydata ● Learn more www.snappydata.io/blog ● Connect: ○ twitter: www.twitter.com/snappydata ○ facebook: www.facebook.com/snappydata ○ slack: http://snappydata-slackin.herokuapp.com
  • 37. Use Case Patterns 1. Operational Analytics DB - Caching for Analytics over disparate sources - Federate query between samples and backend’ 2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query 3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP
  • 39. Snappy Spark Cluster Deployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  • 40. Simple API – Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  • 41. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  • 42. Simple to Ingest Streams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  • 44. How do we extend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  • 45. How do we extend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  • 46. AdImpression Demo Spark, SQL Code Walkthrough, interactive SQL
  • 47. Concurrent Ingest + Query Performance • AWS 4 c4.2xlarge instances - 8 cores, 15GB mem • Each node parallely ingests stream from Kafka • Parallel batch writes to store (32 partitions) • Only few cores used for Stream writes as most of CPU reserved for OLAP queries 0 100000 200000 300000 400000 500000 600000 700000 Spark- Cassandra Spark- InMemoryDB SnappyData Series1 322000 480000 670000 Persecond Throughput Stream ingestion rate (On 4 nodes with cap on CPU to allow for queries) https://github.com/SnappyDataInc/snappy-poc 2X – 45X faster (vs Cassandra, IMDB)
  • 48. Concurrent Ingest + Query Performance 0 10000 20000 30000 40000 30M 60M 90M 30M 60M 90M 30M 60M 90M Spark-Cassandra Spark-InMemoryDBl SnappyData 20346 65061 93960 3649 5801 7295 1056 1571 2144 Q1 Sample “scan” oriented OLAP query(Spark SQL) performance executed while ingesting data select count(*) AS adCount, geo from adImpressions group by geo order by adCount desc limit 20; Response Time(millis) https://github.com/SnappyDataInc/snappy-poc 2X – 45X faster

Notas do Editor

  1. CONTEXT SHOULD BE OUR MISSION …. LAMBDA LIKE WOULD BE BETTER ….
  2. optimizations to enable direct access of storage into local execution variables, avoiding all copying to bring data from storage layer to execution layer (possible only due to our unique embedded mode). Integrated with whole-stage code generation of Spark 2.0 so that these get compiled by JIT into exactly one memory load instruction for one primitive value (uncompressed).
  3. There is a reciprocal relationship with Spark RDDs/DataFrames. any table is visible as a DataFrame and vice versa. Hence, all the spark APIs, tranformations can also be applied to snappy managed tables. For instance, you can use the DataFrame data source API to save any arbitrary DataFrame into a snappy table like shown in the example. One cool aspect of Spark is its ability to take an RDD of objects (say with nested structure) and implicitly infer its schema. i.e. turn into into a DataFrame and store it.
  4. The SQL dialect will be Spark SQL ++. i.e. we are extending SQL to be much more compliant with standard SQL. A number of the extensions that dictate things like HA, disk persistence, etc are all specified through OPTIONS in spark SQL.
  5. CREATE HDFSSTORE streamingstore NameNode 'hdfs://gfxd1:8020' HomeDir 'stream-tables' BatchSize 10 BatchTimeInterval 2000 milliseconds QueuePersistent true MaxWriteOnlyFileSize 200 WriteOnlyFileRolloverInterval 1 minute;
  6. Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration
  7. Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration