SlideShare uma empresa Scribd logo
1 de 79
Baixar para ler offline
Fast and Simplified Streaming,
Ad-Hoc and Batch Analytics with
FiloDB and Spark Streaming
andEvan Chan Helena Edelson
March 2016
Evan Chan
Distinguished Engineer,
User and contributor to Spark since 0.9, Cassandra since
0.6
Co-creator and maintainer of
Tuplejump
@evanfchan
http://velvia.github.io
Spark Job Server
Helena Edelson
VP of Product Engineering,
Committer: Kafka Connect Cassandra, Spark Cassandra
Connector
Contributor to Akka (2 new features in Akka Cluster), Spring
Integration
Speaker @ Spark Summit, Kafka Summit, Strata, QCon, Scala
Days, Scala World, Philly ETE
Tuplejump
@helenaedelson
http://github.com/helena
Tuplejump
is a big data technology leader providing solutions and
development partnership.
Tuplejump
Open Source: on GitHubTuplejump
- Subject of today's talk
- Kafka-Cassandra Source and
Sink
- The rst Spark Cassandra integration
- Lucene indexer for Cassandra
- HDFS for Cassandra
FiloDB
Kafka Connect Cassandra
Calliope
Stargate
SnackFS
Tuplejump Consulting & Development
Tuplejump Blender
Buildsuni eddatasetsforfastquerying,streamingandbatchsources,MLandAnalytics
Topics
Modern streaming architectures and batch/ad-hoc
architectures
Precise and scalable streaming ingestion using Kafka, Akka,
Spark Streaming, Cassandra, and FiloDB
How a uni ed streaming + batch stack can lower your TCO
What FiloDB is and how it enables fast analytics with
competitive storage cost
Data Warehousing with Spark, Cassandra, and FiloDB
Time series / event data / geospatial examples
Machine learning using Spark MLLib + Cassandra/FiloDB
The Problem Domain
Build scalable, adaptable, self-healing, distributed data
processing systems for
Massive amounts of data
Disparate sources and schemas
Differing data structures
Complex analytics and learning tasks
To run as large-scale clustered
data ows
24-7 Uptime
Globally clustered deployments
No data loss
Delivering Meaning
Deliver meaning in sec/sub-sec latency
Billions of events per second
Low latency real time stream processing
Higher latency batch processing
Aggregation of global data from the
stream
While We Monitor, Predict & Proactively
Handle
Massive event spikes & bursty traf c
Fast producers / slow consumers
Network partitioning & out of sync
systems
DC down
Not DDOS'ing ourselves from fast streams
No data loss when auto-scaling down
And stay within our
AWS
OpenStack
Rackspace...
budget
Use Case
I need fast access to historical data on the y for predictive
modeling with real time data from the stream
Only, It's Not A Stream It's A Flood
Net ix
100 billion events per day
1-2 million events per second at
peak
Linkedin
500 billion write events per day
2.5 trillion read events per day
4.5 million events per second at peak with
Kafka
Petabytes of streaming data
Con dential
700 TB global ingested data (pre-
ltered)
Lambda Architecture
A data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch and stream processing
methods.
Lambda Architecture
( )https://www.mapr.com/developercentral/lambda-architecture
Challenge Assumptions
"Ingest an immutable sequence of records is captured and fed into
a batch processing system
and a stream processing system
in parallel"
Lambda Architecture
Dual Analytics Systems
Many moving parts: KV store, real time, Batch technologies
Running similar code and reconciling queries in dual
systems
Overly Complicated
Pipelines
Ops: e.g. performance tuning & monitoring, upgrades...
Analytics logic changes: dev/testing/deploys of changes to
code bases, clusters
Ultimately High TCO
And...
Technical Debt
Are Batch and Streaming Systems
Fundamentally Different?
No.
A Unified Streaming Architecture
Everything On The Streaming Platform
Scala / Spark
Streaming
Mesos
Akka
Cassandra
Kafka
SNACK (SMACK) Stack
Apache Spark Streaming
One runtime for streaming and batch processing
Join streaming and static data sets
No code duplication
Easy Kafka stream integration
Easy to reconcile queries against multiple
sources
Easy integration of KV durable storage
High Throughput Distributed Messaging
Decouples Data Pipelines
Handles Massive Data Load
Support Massive Number of Consumers
Distribution & partitioning across cluster
nodes
Automatic recovery from broker failures
Apache Cassandra
Horizontally scalable
Multi-Region / Multi-Datacenter
Always On - Survive regional outages
Extremely fast writes: - perfect for ingestion of real time /
machine data
Very exible data modelling (lists, sets, custom data types)
Easy to operate
Best of breed storage technology, huge community
BUT: Simple queries only
OLTP-oriented/center
High performance concurrency framework for Scala and
Java
Fault Tolerance
Asynchronous messaging and data processing
Parallelization
Location Transparency
Local / Remote Routing
Akka: Cluster / Persistence / Streams
Enables
Streaming and Batch In One System
Streaming ML and Analytics for Predictions In The Stream
Show me the code
Immutable Raw Data From Kafka Stream
Replay / reprocessing: for fault tolerance, logic changes..
classKafkaStreamingActor(ssc:StreamingContext,settings:Settings)extendsAggregationAc
valstream=KafkaUtils.createDirectStream(...)
.map(RawWeatherData(_))
stream
.foreachRDD(_.toDF.write.format("filodb.spark"))
.option(rawDataKeyspace,rawDataTable)
/*Pre-Aggregatedatainthestreamforfastqueryingandaggregationlater.*/
stream.map(hour=>
(hour.wsid,hour.year,hour.month,hour.day,hour.oneHourPrecip)
).saveToCassandra(CassandraKeyspace,CassandraTableDailyPrecip)

}
Reading Data Back From Cassandra
Compute isolation in Akka Actor
classTemperatureActor(sc:SparkContext,settings:Settings)extendsAggregationActor
importsettings._
importakka.pattern.pipe
defreceive:Actor.Receive={

casee:GetMonthlyHiLowTemperature=>highLow(e,sender)

}


defhighLow(e:GetMonthlyHiLowTemperature,requester:ActorRef):Unit=

sc.cassandraTable[DailyTemperature](timeseriesKeyspace,dailyTempAggregTable)

.where("wsid=?ANDyear=?ANDmonth=?",e.wsid,e.year,e.month)
.collectAsync()

.map(MonthlyTemperature(_,e.wsid,e.year,e.month))pipeTorequester
}
Spark Streaming, MLLib
Kafka, Cassandra
valssc=newStreamingContext(sparkConf,Seconds(5)

valtestData=ssc.cassandraTable[String](keyspace,table)
.map(LabeledPoint.parse)
valtrainingStream=KafkaUtils.createDirectStream[_,_,_,_](..)
.map(transformFunc)
.map(LabeledPoint.parse)
trainingStream.saveToCassandra("ml_training_keyspace","raw_training_data")


valmodel=newStreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(weights))

.trainOn(trainingStream)
model
.predictOnValues(testData.map(lp=>(lp.label,lp.features)))
.saveToCassandra("ml_predictions_keyspace","predictions")
Tuplejump OSS Roadmap
gearpump-external-
lodb
Kafka Connect FiloDB
Gearpump
What's Missing? One Pipeline For Fast +
Big Data
Using Cassandra for Batch Analytics /
Event Storage / ML?
Storage ef ciency and scan speeds for reading large volumes
of data (for complex analytics, ML) become important
concerns
Regular Cassandra CQL tables are not very good at either
storage ef ciency or scan speeds
A different, analytics-optimized solution is needed...
All hard work leads to pro t, but mere talk leads
to poverty.
- Proverbs 14:23
Introducing FiloDB
A distributed, versioned, columnar analytics database.
Built for Streaming.
 
github.com/tuplejump/FiloDB
Fast Analytics Storage
Scan speeds competitive with Apache Parquet
Up to 200x faster scan speeds than with Cassandra 2.x
Flexible ltering along two dimensions
Much more ef cient and exible partition key ltering
Ef cient columnar storage, up to 27x more ef cient than
Cassandra 2.x
Comparing Storage Costs and Query Speeds
https://www.oreilly.com/ideas/apache-cassandra-for-analytics-
a-performance-and-storage-analysis
Robust Distributed Storage
Apache Cassandra as the rock-solid storage engine. Scale out
with no SPOF. Cross-datacenter replication. Proven storage and
database technology.
 
Cassandra-Like Data Model
Column A Column B
Partition
key 1
Segment
1
Segment
2
Segment
1
Segment
2
Partition
key 2
Segment
1
Segment
2
Segment
1
Segment
2
partition keys - distributes data around a cluster, and allows
for ne grained and exible ltering
segment keys - do range scans within a partition, e.g. by time
slice
primary key based ingestion and updates
Flexible Filtering
Unlike Cassandra, FiloDB offers very exible and ef cient
ltering on partition keys. Partial key matches, fast IN queries on
any part of the partition key.
No need to write multiple tables to work around answering different
queries.
Spark SQL Queries!
CREATETABLEgdeltUSINGfilodb.sparkOPTIONS(dataset"gdelt");
SELECTActor1Name,Actor2Name,AvgToneFROMgdeltORDERBYAvgToneDESCLIMIT15
INSERTINTOgdeltSELECT*FROMNewMonthData;
Read to and write from Spark Dataframes
Append/merge to FiloDB table from Spark
Streaming
Use Tableau or any other JDBC tool
What's in the name?
Rich sweet layers of distributed, versioned database goodness
SNACK (SMACK) stack for all your
Analytics
Regular Cassandra tables for highly concurrent, aggregate /
key-value lookups (dashboards)
FiloDB + C* + Spark for ef cient long term event storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classi ed / predicted / scored data
Being Productionized as we speak...
One enterprise with many TB of nancial and reporting data is
moving their data warehouse to FiloDB + Cassandra + Spark
Another startup uses FiloDB as event storage, feeds the events
into Spark MLlib, scores incoming data, then stores the results
back in FiloDB for low-latency use cases
From their CTO: “I see close to MemSQL / Vertica or even
better” “More cost effective than Redshift”
Data Warehousing with FiloDB
Scenarios
BI Reporting, concurrency + seconds latency
Ad-hoc queries
Needing to do JOINs with fact tables + dimension
tables
Slowly changing dim tables / hard to denormalize
Need to work with legacy BI tools
Real-world DW Architecture Stack
Ef cient columnar storage + ltering = low latency BI
Modeling Fact Tables for FiloDB
Single partition queries are really fast and take up only one
thread
Given the following two partition key columns:
entity_number, year_month
WHERE entity_number = '0453' AND
year_month = '2014 December'
Exact match for partition key is pushed down as one
partition
Consider the partition key carefully
Cassandra often requires multiple tables
What about the queries that do not translate to one partition?
Cassandra has many restrictions on partition key ltering (as of
2.x).
Table 1: partition key = (entity_number, year_month)
Can push down: WHERE entity_number = NN AND
year_month IN ('2014 Jan', '2014 Feb')as
well as equals
Table 2: partition key = (year_month, entity_number)
Can push down: WHERE year_month = YYMM AND
entity_number IN (123, 456)as well as equals
IN clause must be the last column to be pushed down. Two tables
are needed just for ef cient IN queries on either entity_number
or year_month.
FiloDB Flexible Partition Filters = WIN
With ONE table, FiloDB offers FAST, arbitrary partition key
ltering. All of the below are pushed down:
WHERE year_month IN ('2014 Jan', '2014 Feb')
(all entities)
WHERE entity_number = 146(all year months)
Any combo of =, IN
Space savings: 27 *2 = 54x
Multi-Table JOINs with just Cassandra
Sub-second Multi-Table JOINs with FiloDB
Sub-second Multi-Table JOINs with FiloDB
Four tables, all of them single-partition queries
Two tables were switched from regular Cassandra tables to
FiloDB tables. 40-60 columns each, ~60k items in partition.
Scan times went down from 5-6 seconds to < 250ms
For more details, please see this .Planet Cassandra blog post
Scalable Time-Series / Event Storage with
FiloDB
Designed for Streaming
New rows appended via Spark Streaming or Kafka
Writes are idempotent - easy exactly once ingestion
Converted to columnar chunks on ingest and stored in
C*
FiloDB keeps your data sorted as it is being ingested
Spark Streaming -> FiloDB
valratingsStream=KafkaUtils.createDirectStream[String,String,StringDecoder,Strin
ratingsStream.foreachRDD{
(message:RDD[(String,String)],batchTime:Time)=>{
valdf=message.map(_._2.split(",")).map(rating=>Rating(rating(0).trim.toInt,r
toDF("fromuserid","touserid","rating")
//addthebatchtimetotheDataFrame
valdfWithBatchTime=df.withColumn("batch_time",org.apache.spark.sql.functions.l
//savetheDataFrametoFiloDB
dfWithBatchTime.write.format("filodb.spark")
.option("dataset","ratings")
.save()
}
}
One-line change to write to FiloDB vs Cassandra
Modeling example: NYC Taxi Dataset
The public contains telemetry (pickup, dropoff
locations, times) info on millions of taxi rides in NYC.
NYC Taxi Dataset
Medallion pre x 1/1 - 1/6 1/7 - 1/12
AA records records
AB records records
Partition key - :stringPrefix medallion 2- hash
multiple drivers trips into ~300 partitions
Segment key - :timeslice pickup_datetime 6d
Row key - hack_license, pickup_datetime
Allows for easy ltering by individual drivers, and slicing by time.
DEMO TIME
New York City Taxi Data Demo (Spark Notebook)
To follow along:
https://github.com/tuplejump/FiloDB/blob/master/doc/FiloDB_Taxi_G
Fast, Updatable In-Memory
Columnar Storage
Unlike RDDs and DataFrames, FiloDB can ingest new data, and
still be fast
Unlike RDDs, FiloDB can lter in multiple ways, no need for
entire table scan
FAIR scheduler + sub-second latencies => web speed queries
700 Queries Per Second in Apache Spark!
Even for datasets with 15 million rows!
Using FiloDB's InMemoryColumnStore, single host / MBP,
5GB RAM
SQL to DataFrame caching
For more details, see .this blog post
Machine Learning with Spark, Cassandra,
and FiloDB
Building a static model of NYC Taxi Trips
Predict time to get to destination based on pickup point, time
of day, other vars
Need to read all data (full table scan)
Dynamic models are better than static
models
Everything changes!
Continuously re ne model based on recent streaming data +
historical data + existing model
valssc=newStreamingContext(sparkConf,Seconds(5)
)
valdataStream=KafkaUtils.createDirectStream[..](..)
.map(transformFunc)
.map(LabeledPoint.parse)
dataStream.foreachRDD(_.toDF.write.format("filodb.spark")
.option("dataset","training").save())
if(trainNow){

varmodel=newStreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(weights))

.trainOn(dataStream.join(historicalEvents))
}
model.predictOnValues(dataStream.map(lp=>(lp.label,lp.features)))
.insertIntoFilo("predictions")
The FiloDB Advantage for ML
Able to update dynamic models based on massive data
ow/updates
Integrate historical and recent events to build models
More data -> better models!
Can store scored raw data / predictions back in FiloDB
for fast user queries
FiloDB - Roadmap
Your input is appreciated!
Productionization and automated stress testing
Kafka input API / connector (without needing Spark)
In-memory caching for signi cant query speedup
True columnar querying and execution, using late
materialization and vectorization techniques. GPU/SIMD.
Projections. Often-repeated queries can be sped up
signi cantly with projections.
Thanks For Attending!
@helenaedelson
@evanfchan
@tuplejump
EXTRA SLIDES
What are my storage needs?
Non-persistent / in-memory: concurrent
viewers
Short term: latest trends
Longer term: raw event and aggregate storage
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or on
disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of data
Snapshotting for recovery
Using Cassandra for Short Term Storage
1020s 1010s 1000s
Bus A Speed, GPS
Bus B
Bus C
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single bus for a range of
time
Can also query most recent location + speed of all buses
(slower)
FiloDB - How?
Multiple ways to Accelerate Queries
Columnar projection - read fewer columns, saves I/O
Partition key ltering - read less data
Sort key / PK ltering - read from subset of keys
Possible because FiloDB keeps data sorted
Versioning - write to multiple versions, read from the one you
choose
Cassandra CQL vs Columnar Layout
Cassandra stores CQL tables row-major, each row spans multiple
cells:
PartitionKey 01: rst 01:last 01:age 02: rst 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
 
Columnar layouts are column-major:
PartitionKey rst last age
Sales Bob, Susan Jones,
O'Connor
34,
40
Engineering Dilbert,
Dogbert
P, Dog ?, 1
FiloDB Cassandra Schema
CREATETABLEfilodb.gdelt_chunks(
partitiontext,
versionint,
columnnametext,
segmentidblob,
chunkidint,
datablob,
PRIMARYKEY((partition,version),columnname,segmentid,chunkid)
)WITHCLUSTERINGORDERBY(columnnameASC,segmentidASC,chunkidASC)
FiloDB Architecture
ColumnStore API - currently Cassandra and InMemory, you can
implement other backends - ElasticSearch? etc.

Mais conteúdo relacionado

Mais procurados

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 

Mais procurados (20)

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 

Semelhante a Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comCedric Vidal
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Getting Started with Spark Streaming
Getting Started with Spark StreamingGetting Started with Spark Streaming
Getting Started with Spark StreamingAlex Apollonsky
 

Semelhante a Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming (20)

TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Getting Started with Spark Streaming
Getting Started with Spark StreamingGetting Started with Spark Streaming
Getting Started with Spark Streaming
 

Último

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Último (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming

  • 1. Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming andEvan Chan Helena Edelson March 2016
  • 2. Evan Chan Distinguished Engineer, User and contributor to Spark since 0.9, Cassandra since 0.6 Co-creator and maintainer of Tuplejump @evanfchan http://velvia.github.io Spark Job Server
  • 3. Helena Edelson VP of Product Engineering, Committer: Kafka Connect Cassandra, Spark Cassandra Connector Contributor to Akka (2 new features in Akka Cluster), Spring Integration Speaker @ Spark Summit, Kafka Summit, Strata, QCon, Scala Days, Scala World, Philly ETE Tuplejump @helenaedelson http://github.com/helena
  • 4. Tuplejump is a big data technology leader providing solutions and development partnership. Tuplejump
  • 5. Open Source: on GitHubTuplejump - Subject of today's talk - Kafka-Cassandra Source and Sink - The rst Spark Cassandra integration - Lucene indexer for Cassandra - HDFS for Cassandra FiloDB Kafka Connect Cassandra Calliope Stargate SnackFS
  • 8. Topics Modern streaming architectures and batch/ad-hoc architectures Precise and scalable streaming ingestion using Kafka, Akka, Spark Streaming, Cassandra, and FiloDB How a uni ed streaming + batch stack can lower your TCO What FiloDB is and how it enables fast analytics with competitive storage cost Data Warehousing with Spark, Cassandra, and FiloDB Time series / event data / geospatial examples Machine learning using Spark MLLib + Cassandra/FiloDB
  • 9. The Problem Domain Build scalable, adaptable, self-healing, distributed data processing systems for Massive amounts of data Disparate sources and schemas Differing data structures Complex analytics and learning tasks To run as large-scale clustered data ows 24-7 Uptime Globally clustered deployments No data loss
  • 10. Delivering Meaning Deliver meaning in sec/sub-sec latency Billions of events per second Low latency real time stream processing Higher latency batch processing Aggregation of global data from the stream
  • 11. While We Monitor, Predict & Proactively Handle Massive event spikes & bursty traf c Fast producers / slow consumers Network partitioning & out of sync systems DC down Not DDOS'ing ourselves from fast streams No data loss when auto-scaling down
  • 12. And stay within our AWS OpenStack Rackspace... budget
  • 13. Use Case I need fast access to historical data on the y for predictive modeling with real time data from the stream
  • 14. Only, It's Not A Stream It's A Flood Net ix 100 billion events per day 1-2 million events per second at peak Linkedin 500 billion write events per day 2.5 trillion read events per day 4.5 million events per second at peak with Kafka Petabytes of streaming data Con dential 700 TB global ingested data (pre- ltered)
  • 15. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods.
  • 17. Challenge Assumptions "Ingest an immutable sequence of records is captured and fed into a batch processing system and a stream processing system in parallel"
  • 18. Lambda Architecture Dual Analytics Systems Many moving parts: KV store, real time, Batch technologies Running similar code and reconciling queries in dual systems Overly Complicated Pipelines Ops: e.g. performance tuning & monitoring, upgrades... Analytics logic changes: dev/testing/deploys of changes to code bases, clusters
  • 21. Are Batch and Streaming Systems Fundamentally Different? No.
  • 22. A Unified Streaming Architecture Everything On The Streaming Platform Scala / Spark Streaming Mesos Akka Cassandra Kafka
  • 24. Apache Spark Streaming One runtime for streaming and batch processing Join streaming and static data sets No code duplication Easy Kafka stream integration Easy to reconcile queries against multiple sources Easy integration of KV durable storage
  • 25. High Throughput Distributed Messaging Decouples Data Pipelines Handles Massive Data Load Support Massive Number of Consumers Distribution & partitioning across cluster nodes Automatic recovery from broker failures
  • 26. Apache Cassandra Horizontally scalable Multi-Region / Multi-Datacenter Always On - Survive regional outages Extremely fast writes: - perfect for ingestion of real time / machine data Very exible data modelling (lists, sets, custom data types) Easy to operate Best of breed storage technology, huge community BUT: Simple queries only OLTP-oriented/center
  • 27. High performance concurrency framework for Scala and Java Fault Tolerance Asynchronous messaging and data processing Parallelization Location Transparency Local / Remote Routing Akka: Cluster / Persistence / Streams
  • 28. Enables Streaming and Batch In One System Streaming ML and Analytics for Predictions In The Stream Show me the code
  • 29. Immutable Raw Data From Kafka Stream Replay / reprocessing: for fault tolerance, logic changes.. classKafkaStreamingActor(ssc:StreamingContext,settings:Settings)extendsAggregationAc valstream=KafkaUtils.createDirectStream(...)
.map(RawWeatherData(_)) stream .foreachRDD(_.toDF.write.format("filodb.spark")) .option(rawDataKeyspace,rawDataTable) /*Pre-Aggregatedatainthestreamforfastqueryingandaggregationlater.*/ stream.map(hour=> (hour.wsid,hour.year,hour.month,hour.day,hour.oneHourPrecip) ).saveToCassandra(CassandraKeyspace,CassandraTableDailyPrecip)
 }
  • 30. Reading Data Back From Cassandra Compute isolation in Akka Actor classTemperatureActor(sc:SparkContext,settings:Settings)extendsAggregationActor importsettings._ importakka.pattern.pipe defreceive:Actor.Receive={
 casee:GetMonthlyHiLowTemperature=>highLow(e,sender)
 }

 defhighLow(e:GetMonthlyHiLowTemperature,requester:ActorRef):Unit=
 sc.cassandraTable[DailyTemperature](timeseriesKeyspace,dailyTempAggregTable)
 .where("wsid=?ANDyear=?ANDmonth=?",e.wsid,e.year,e.month) .collectAsync()
 .map(MonthlyTemperature(_,e.wsid,e.year,e.month))pipeTorequester }
  • 31. Spark Streaming, MLLib Kafka, Cassandra valssc=newStreamingContext(sparkConf,Seconds(5)
 valtestData=ssc.cassandraTable[String](keyspace,table) .map(LabeledPoint.parse) valtrainingStream=KafkaUtils.createDirectStream[_,_,_,_](..) .map(transformFunc) .map(LabeledPoint.parse) trainingStream.saveToCassandra("ml_training_keyspace","raw_training_data")
 
valmodel=newStreamingLinearRegressionWithSGD()
 .setInitialWeights(Vectors.dense(weights))
 .trainOn(trainingStream) model .predictOnValues(testData.map(lp=>(lp.label,lp.features))) .saveToCassandra("ml_predictions_keyspace","predictions")
  • 33. What's Missing? One Pipeline For Fast + Big Data
  • 34. Using Cassandra for Batch Analytics / Event Storage / ML? Storage ef ciency and scan speeds for reading large volumes of data (for complex analytics, ML) become important concerns Regular Cassandra CQL tables are not very good at either storage ef ciency or scan speeds A different, analytics-optimized solution is needed...
  • 35. All hard work leads to pro t, but mere talk leads to poverty. - Proverbs 14:23
  • 36. Introducing FiloDB A distributed, versioned, columnar analytics database. Built for Streaming.   github.com/tuplejump/FiloDB
  • 37. Fast Analytics Storage Scan speeds competitive with Apache Parquet Up to 200x faster scan speeds than with Cassandra 2.x Flexible ltering along two dimensions Much more ef cient and exible partition key ltering Ef cient columnar storage, up to 27x more ef cient than Cassandra 2.x
  • 38. Comparing Storage Costs and Query Speeds https://www.oreilly.com/ideas/apache-cassandra-for-analytics- a-performance-and-storage-analysis
  • 39. Robust Distributed Storage Apache Cassandra as the rock-solid storage engine. Scale out with no SPOF. Cross-datacenter replication. Proven storage and database technology.
  • 40.   Cassandra-Like Data Model Column A Column B Partition key 1 Segment 1 Segment 2 Segment 1 Segment 2 Partition key 2 Segment 1 Segment 2 Segment 1 Segment 2 partition keys - distributes data around a cluster, and allows for ne grained and exible ltering segment keys - do range scans within a partition, e.g. by time slice primary key based ingestion and updates
  • 41. Flexible Filtering Unlike Cassandra, FiloDB offers very exible and ef cient ltering on partition keys. Partial key matches, fast IN queries on any part of the partition key. No need to write multiple tables to work around answering different queries.
  • 43. What's in the name? Rich sweet layers of distributed, versioned database goodness
  • 44. SNACK (SMACK) stack for all your Analytics Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for ef cient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classi ed / predicted / scored data
  • 45.
  • 46. Being Productionized as we speak... One enterprise with many TB of nancial and reporting data is moving their data warehouse to FiloDB + Cassandra + Spark Another startup uses FiloDB as event storage, feeds the events into Spark MLlib, scores incoming data, then stores the results back in FiloDB for low-latency use cases From their CTO: “I see close to MemSQL / Vertica or even better” “More cost effective than Redshift”
  • 48. Scenarios BI Reporting, concurrency + seconds latency Ad-hoc queries Needing to do JOINs with fact tables + dimension tables Slowly changing dim tables / hard to denormalize Need to work with legacy BI tools
  • 49. Real-world DW Architecture Stack Ef cient columnar storage + ltering = low latency BI
  • 50. Modeling Fact Tables for FiloDB Single partition queries are really fast and take up only one thread Given the following two partition key columns: entity_number, year_month WHERE entity_number = '0453' AND year_month = '2014 December' Exact match for partition key is pushed down as one partition Consider the partition key carefully
  • 51. Cassandra often requires multiple tables What about the queries that do not translate to one partition? Cassandra has many restrictions on partition key ltering (as of 2.x). Table 1: partition key = (entity_number, year_month) Can push down: WHERE entity_number = NN AND year_month IN ('2014 Jan', '2014 Feb')as well as equals Table 2: partition key = (year_month, entity_number) Can push down: WHERE year_month = YYMM AND entity_number IN (123, 456)as well as equals IN clause must be the last column to be pushed down. Two tables are needed just for ef cient IN queries on either entity_number or year_month.
  • 52. FiloDB Flexible Partition Filters = WIN With ONE table, FiloDB offers FAST, arbitrary partition key ltering. All of the below are pushed down: WHERE year_month IN ('2014 Jan', '2014 Feb') (all entities) WHERE entity_number = 146(all year months) Any combo of =, IN Space savings: 27 *2 = 54x
  • 53. Multi-Table JOINs with just Cassandra
  • 55. Sub-second Multi-Table JOINs with FiloDB Four tables, all of them single-partition queries Two tables were switched from regular Cassandra tables to FiloDB tables. 40-60 columns each, ~60k items in partition. Scan times went down from 5-6 seconds to < 250ms For more details, please see this .Planet Cassandra blog post
  • 56. Scalable Time-Series / Event Storage with FiloDB
  • 57. Designed for Streaming New rows appended via Spark Streaming or Kafka Writes are idempotent - easy exactly once ingestion Converted to columnar chunks on ingest and stored in C* FiloDB keeps your data sorted as it is being ingested
  • 58. Spark Streaming -> FiloDB valratingsStream=KafkaUtils.createDirectStream[String,String,StringDecoder,Strin ratingsStream.foreachRDD{ (message:RDD[(String,String)],batchTime:Time)=>{ valdf=message.map(_._2.split(",")).map(rating=>Rating(rating(0).trim.toInt,r toDF("fromuserid","touserid","rating") //addthebatchtimetotheDataFrame valdfWithBatchTime=df.withColumn("batch_time",org.apache.spark.sql.functions.l //savetheDataFrametoFiloDB dfWithBatchTime.write.format("filodb.spark") .option("dataset","ratings") .save() } } One-line change to write to FiloDB vs Cassandra
  • 59. Modeling example: NYC Taxi Dataset The public contains telemetry (pickup, dropoff locations, times) info on millions of taxi rides in NYC. NYC Taxi Dataset Medallion pre x 1/1 - 1/6 1/7 - 1/12 AA records records AB records records Partition key - :stringPrefix medallion 2- hash multiple drivers trips into ~300 partitions Segment key - :timeslice pickup_datetime 6d Row key - hack_license, pickup_datetime Allows for easy ltering by individual drivers, and slicing by time.
  • 60. DEMO TIME New York City Taxi Data Demo (Spark Notebook) To follow along: https://github.com/tuplejump/FiloDB/blob/master/doc/FiloDB_Taxi_G
  • 61. Fast, Updatable In-Memory Columnar Storage Unlike RDDs and DataFrames, FiloDB can ingest new data, and still be fast Unlike RDDs, FiloDB can lter in multiple ways, no need for entire table scan FAIR scheduler + sub-second latencies => web speed queries
  • 62. 700 Queries Per Second in Apache Spark! Even for datasets with 15 million rows! Using FiloDB's InMemoryColumnStore, single host / MBP, 5GB RAM SQL to DataFrame caching For more details, see .this blog post
  • 63. Machine Learning with Spark, Cassandra, and FiloDB
  • 64. Building a static model of NYC Taxi Trips Predict time to get to destination based on pickup point, time of day, other vars Need to read all data (full table scan)
  • 65. Dynamic models are better than static models Everything changes! Continuously re ne model based on recent streaming data + historical data + existing model
  • 67.
  • 68. The FiloDB Advantage for ML Able to update dynamic models based on massive data ow/updates Integrate historical and recent events to build models More data -> better models! Can store scored raw data / predictions back in FiloDB for fast user queries
  • 69. FiloDB - Roadmap Your input is appreciated! Productionization and automated stress testing Kafka input API / connector (without needing Spark) In-memory caching for signi cant query speedup True columnar querying and execution, using late materialization and vectorization techniques. GPU/SIMD. Projections. Often-repeated queries can be sped up signi cantly with projections.
  • 72. What are my storage needs? Non-persistent / in-memory: concurrent viewers Short term: latest trends Longer term: raw event and aggregate storage ML Models, predictions, scored data
  • 73. Spark RDDs Immutable, cache in memory and/or on disk Spark Streaming: UpdateStateByKey IndexedRDD - can update bits of data Snapshotting for recovery
  • 74. Using Cassandra for Short Term Storage 1020s 1010s 1000s Bus A Speed, GPS Bus B Bus C Primary key = (Bus UUID, timestamp) Easy queries: location and speed of single bus for a range of time Can also query most recent location + speed of all buses (slower)
  • 76. Multiple ways to Accelerate Queries Columnar projection - read fewer columns, saves I/O Partition key ltering - read less data Sort key / PK ltering - read from subset of keys Possible because FiloDB keeps data sorted Versioning - write to multiple versions, read from the one you choose
  • 77. Cassandra CQL vs Columnar Layout Cassandra stores CQL tables row-major, each row spans multiple cells: PartitionKey 01: rst 01:last 01:age 02: rst 02:last 02:age Sales Bob Jones 34 Susan O'Connor 40 Engineering Dilbert P ? Dogbert Dog 1   Columnar layouts are column-major: PartitionKey rst last age Sales Bob, Susan Jones, O'Connor 34, 40 Engineering Dilbert, Dogbert P, Dog ?, 1
  • 79. FiloDB Architecture ColumnStore API - currently Cassandra and InMemory, you can implement other backends - ElasticSearch? etc.