SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
Stream All the Things
Streaming Architectures for
Data Streams that Never End
Dean Wampler, Ph.D. (@deanwampler)
Free, as in 🍺
bit.ly/fastdata-ORbook
My	book,	published	last	fall	that	describes	the	points	in	this	talk	in	greater	depth.	I’ve	refined	the	talk	a	bit	since	this	was	published.
Streaming Engines in Context…
Classic Batch Architecture:
Hadoop
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
YARN
Resource	
Manager
Node	
Manager
N
M
Batch
MapReduce
…
Spark
Flume
SqoopDBs
Schematic	view	of	Hadoop.	You	need	3	core	things…
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
YARN
Resource	
Manager
Node	
Manager
N
M
Batch
MapReduce
…
Spark
Flume
SqoopDBs
1.	Storage	tier,	either	HDFS	(Hadoop	Distributed	File	System)	or	alternatives	like	S3	or	databases.
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
YARN
Resource	
Manager
Node	
Manager
N
M
Batch
MapReduce
…
Spark
Flume
SqoopDBs
2.	You	need	a	compute	engine	for	processing	data.	First	we	had	MapReduce,	then	a	few	other	tools	to	replace	it,	finally	Spark	has	emerged	as	the	most	viable	successor.
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
YARN
Resource	
Manager
Node	
Manager
N
M
Batch
MapReduce
…
Spark
Flume
SqoopDBs
3.	Finally,	you	need	a	manager	of	resources,	scheduler	of	jobs	and	tasks,	etc.
Logs
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
YARN
Resource	
Manager
Node	
Manager
N
M
Batch
MapReduce
…
Spark
Flume
SqoopDBs
Other	tools	are	built	on	this	foundation	or	supplement	it,	like	tools	for	ingesting	data,	such	as	Sqoop	and	Flume.
New Streaming,
“Fast Data” Architecture
(but it also supports batch)
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Numbering	is	in	the	report,	so	it’s	easier	for	the	text	to	refer	back	to	the	figure.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
1	&	2.	Data	sources	include	streams	of	data	over	sockets	and	from	logs.	You	might	also	ingest	through	REST	channels,	but	unless	they	are	async,	the	overhead	will	be	too	high,	so	REST	might	be	used	only	for	communicating	with	the	microservices	(3)	you	write	to	complete	your	environment.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
3.	A	real	fast	data	environment	is	similar	to	more	general	services,	you’ll	have	the	“heavy	hitters”	like	Spark,	Kafka,	etc.,	but	you’ll	also	need	to	write	support	microservices	to	complete	the	environment.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
4.	Some	services,	like	Kafka,	require	ZooKeeper	for	managing	shares	state,	such	as	leaders.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
5.	Kafka	is	the	core	of	this	architecture,	the	place	where	data	is	ingested	in	queues,	organized	into	topics.	Publishers	are	consumers	are	decoupled	and	N-to-M.	Kafka	has	massive	scalability	and	excellent	resiliency	and	data	durability.	All	services	can	communicate	through	each	other	using	Kafka,	too,	rather	than	having	to	manage	arbitrary	
point-to-point	connections.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
6,	8,	and	10.	Many	pure	streaming,	mini-batch	streaming,	and	batch	engines	are	vying	for	your	attention.	We’ll	focus	into	this	area	after	this	overview.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
7,	9,	and	10.	Data	can	also	be	read	and	written	between	these	compute	engines	and	storage,	not	just	Kafka.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
11.	You	can	run	this	architecture	on	Mesos,	YARN	(Hadoop),	on	premise	or	in	the	cloud.	(At	Lightbend,	we	think	Mesos	is	the	best	choice.)
Streaming Engines
Now	let’s	zero	in…
Features to Consider
When	deciding	what	to	use,	there	are	several	dimensions	or	features	to	consider.
•Low latency? How low?
•…
If	you	have	tight	latency	requirements,	you	can’t	use	a	mini-batch	or	batch	engine.	If	you’re	constraints	are	more	flexible,	you	can	do	more	sophisticated	things,	like	training	ML	models,	writing	to	databases,	etc.
•Low latency? How low?
•High volume? How high?
•…
Some	tools	are	better	at	large	volumes	than	others.	If	you	have	low	volumes,	you’ll	want	tools	that	are	efficient	at	low	volumes,	whereas	some	of	the	high-volume	tools	are	efficient	per	record	when	amortized	over	the	stream.
•Low latency? How low?
•High volume? How high?
•Integration with other tools?
•Which ones?
•…
Connecting	everything	through	Kafka	can	eliminate	this	requirement,	but	often	that’s	not	ideal	and	you	may	need	some	tools	to	have	direct	connections	to	other	tools.
•…
•Which kinds of data processing,
analytics?
•Process individual events?
•Process records in bulk?
Is	your	data	processing	essential	data	warehouse	kinds	of	analytics	or	similar	SQL	queries?	Are	you	doing	ETL	of	data?	Are	you	doing	aggregations?	Who	is	consuming	this	data?	
Are	you	doing	complex	event	processing	(CEP)	where	you	need	to	make	decisions	individually,	per	event?	Or,	is	it	a	stream	with	processing	“en	masse”,	like	typical	SQL	and	ETL	processing.	
We’ll	comment	on	how	these	characteristics	are	realized	in	the	specific	tools	we’ll	discuss	next,	but	if	you’re	using	other	tools,	consider	how	they	fit	these	dimensions.
Best of Breed
Streaming Engines
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Focusing	in	on	the	pure	(low-latency)	streaming	and	mini	batch	engines	on	the	right…
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Apache Beam
• Based on Google Dataflow
• Requires a “runner”
• (Dataflow is the runner in
Google Cloud…)
• Most sophisticated streaming
semantics.
Google	has	spent	years	thinking	about	all	the	things	a	streaming	engine	has	to	handle	if	you	want	to	do	accurate	calculations	on	streams,	accounting	for	all	the	contingencies	that	can	happen.	The	latest	incarnation	is	Google	Dataflow,	available	as	a	service	in	Google	Cloud	Platform.	Google	recently	open-sourced	the	part	of	Dataflow	for	
defining	“data	flows”	with	these	sophisticated	semantics,	called	Apache	Beam.	Beam	doesn’t	provide	a	runner,	so	third-party	tools	provide	this	capability.
Here	is	an	example	of	the	scenarios	you	have	to	handle.	Suppose	you	are	computing	per-minute	aggregations	(like	sales	per	minute).	The	analysis	machine	has	one	clock	that	is	not	necessarily	in	sync	with	the	other	servers	processing	sales.	Worse,	there	is	an	unavoidable	time	delay	for	events	to	arrive	to	the	analysis	server.	Hence	you	need	to	
process	event	time	and	you	need	to	account	for	late	arriving	data,	not	only	small	delays	where	some	events/minute	will	arrive	within	the	following	minute,	but	even	large	delays	due	to	network	partitions,	servers	busy,	etc.,	etc.	This	is	one	example	of	the	challenges	posed	by	streaming	when	you	want	accurate	vs.	approximate	results.
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Apache Flink
• High volume
• Low latency
• Apache Beam runner
• Evolving SQL and ML support
Flink	provides	two	unique	capabilities	among	these	choices,	1)	low-latency	processing	at	scale	and	2)	it	is	the	most	mature	runner	for	Apache	Beam	data	flows	(at	least	that	I	know	of)	outside	Google’s	own	Dataflow	engine.
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Akka Streams
• Complex event processing
• Efficient per-event
• Not for large scale “pipes”
• Low latency
• “Might” become an Apache
Beam runner
There	is	a	lesser-known	engine	called	Gearpump	(Intel	project)	that	offers	similar	capabilities	to	Flink,	but	is	implemented	in	Akka,	so	it	could	provide	a	Beam	runner	capability	on	top	of	Akka	Streams.	Hence,	Akka	Streams	might	be	extended	to	run	Beam	flows	for	the	situation	where	event	processing	(i.e.,	CEP)	is	the	preferred	model,	as	
opposed	to	“fat”	data	pipes.	Akka	also	provides	a	powerful	Actor	model	for	rich	concurrency,	and	libraries	for	clustering,	state	management,	and	interop	with	many	sources	and	sinks	(the	“Alpakka”	project).
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Kafka Streams
• Low-overhead processing of
Kafka topics.
• Ideal for:
• ETL of raw data
• Last value for a key,
aggregations (“KTable”
abstraction)
Kafka	Streams	is	a	great	80%	solution	that	works	close	with	Kafka	(in	terms	of	production	deployment	and	overhead).	It	is	designed	to	read	Kafka	topics,	perform	processing	like	transformations,	filtering,	aggregations,	“last-seen”	value	for	keys,	etc.	then	write	results	back	to	Kafka.	It	nicely	addresses	many	common	design	challenges,	but	isn’t	
designed	to	be	a	complete	solution	for	stream	processing,	e.g.,	running	SQL	queries	and	training	ML	models.
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
Ka>a Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• Spark
• Streaming: mini-batch model
• Latency > ~500 msecs
• Ideal for:
• Rich SQL
• ML model training
• Beam support? Coming…
Spark	Streaming	is	a	mini-batch	model,	although	this	is	slowly	being	replaced	with	a	true,	low-latency	streaming	model.	So,	today,	use	it	for	more	expensive	calculations,	like	training	ML	models,	but	don’t	use	it	when	the	lowest-latency	processing	is	required.	Lightbend’s	Fast	Data	Platform	is	working	on	tools	to	make	it	easy	to	train	ML	models	
in	Spark	and	serve	them	with	the	other	tools.	Another	advantage	of	Spark	is	the	rich	ecosystem	of	tools	for	a	wide	variety	of	batch	and	streaming	scenarios.
What about Microservices?
I’m	going	to	argue	that	service	architectures	(classically	three	tier,	but	evolving…)	and	data	architectures	(classically	Big	Data	like	Hadoop)	are	converging.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
• How is this…
• … like this?
• Image from:
You	can	get	this	great	report	by	Jonas	Boner	here:	lightbend.com/reactive-microservices-architecture
• A Data app | microservice:
• Has one responsibility
• …
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Each	data	app,	streaming	or	batch,	is	typically	doing	one	thing,	like	ETL	this	Kafka	topic	to	Cassandra,	or	compute	aggregates	for	a	dashboard,	etc.	Similarly,	services	evolving	now	into	microservices	are	also	supposed	to	do	one	thing	and	communicate	with	other	services	for	the	“help”	they	need.
• A Data app | microservice:
• Has one responsibility
• Ingests and processes a never
ending stream of data |
messages
• …
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
A	streaming	data	app	will	have	a	never-ending	stream	of	data	to	process.	Similarly,	requests	for	service	will	never	stop	coming	to	microservices.
• A Data app | microservice:
• Has one responsibility
• Ingests and processes a never
ending stream of data |
messages
• Must be available, responsive,
resilient, scalable, … reactive
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Hence,	both	kinds	of	systems	must	provide	high	availability	to	respond	to	the	stream	of	data|messages	coming	in.	That	means	that	both	must	be	resilient	against	failure,	scale	on	demand	to	meet	the	load.	These	are	the	reactive	principles.
• Thesis:
• Microservice and Data
architectures are converging.
• Similar design problems.
• Data tends to dominate as
environment grows.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
So,	both	kinds	of	systems	have	similar	design	problems,	hence	both	should	be	implemented	in	similar	ways.	Also,	organizations	that	aren’t	particularly	data	focused,	especially	when	they	are	new	endeavors,	often	find	that	data	grows	to	be	a	dominant	concern,	a	sign	of	success!	For	example,	Twitter	started	as	a	classic	three-tier	application,	
but	now	most	of	their	infrastructure	looks	like	a	petroleum	refinery.
• <Marketing_Plug />
• Lightbend Fast Data Platform
• Converged data services
and microservice tools
• Quick start, expert guidance
• Intelligent cluster
management tools.
Mesos, YARN, Cloud, …
Logs
Sockets
REST
ZooKeeper Cluster
ZK
Mini-batch
Spark	
Streaming
Batch
Spark
…
Low Latency
Flink
Ka9a	Streams
Akka	Streams
Beam
…
Persistence
S3
HDFS
DiskDiskDisk
SQL/
NoSQL
Search
1
5
6
3
11
KaEa Cluster
Ka9a
Microservices
RP Go
Node.js …
2
4
7
8
9
10
Beam
Free, as in 🍺
bit.ly/fastdata-ORbook
Once	again,	the	link	to	my	book,	published	last	fall	that	describes	the	points	in	this	talk	in	greater	depth.	I’ve	refined	the	talk	a	bit	since	this	was	published.
Interested in
Lightbend Fast Data Platform?
lightbend.com/fast-data-platform
This	URL	takes	you	to	information	about	the	product	we’re	building,	Fast	Data	Platform,	that	helps	teams	succeed	with	these	technologies,	from	initial	installation	and	configuration,	to	writing	applications,	to	runtime	production	monitoring,	to	advanced	machine-learning	based	automation.

Mais conteúdo relacionado

Mais procurados

Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business AnalystGustaf Cavanaugh
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 

Mais procurados (20)

Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Spark For The Business Analyst
Spark For The Business AnalystSpark For The Business Analyst
Spark For The Business Analyst
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 

Semelhante a Stream all the things

In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Spark Overview - Oleg Mürk
Spark Overview - Oleg MürkSpark Overview - Oleg Mürk
Spark Overview - Oleg MürkPlanet OS
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 

Semelhante a Stream all the things (20)

In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
HADOOP
HADOOPHADOOP
HADOOP
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Spark Overview - Oleg Mürk
Spark Overview - Oleg MürkSpark Overview - Oleg Mürk
Spark Overview - Oleg Mürk
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Data Science
Data ScienceData Science
Data Science
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 

Último

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 

Último (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 

Stream all the things