SlideShare uma empresa Scribd logo
1 de 88
Baixar para ler offline
Building a Versatile Analytics Pipeline
On Top Of Apache Spark
Misha Chernetsov, Grammarly
Spark Summit 2017
June 6, 2017
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov
Data Team Lead @ Grammarly
Building Analytics Pipelines (5 years)
Coding on JVM (12 years), Scala + Spark (3 years)
About Me: Misha Chernetsov
@chernetsov
Tool that helps us better understand:
● Who are our users?
● How do they interact with the product?
● How do they get in, engage, pay, and how long do they stay?
Analytics @ Consumer Product Company
We want our decisions to be
data-driven
Everyone: product managers, marketing, engineers, support...
Analytics @ Consumer Product Company
data
analytics
report
Analytics @ Consumer Product Company
Calendar Day
Number of
unique active
users by day
Example Report 1 – Daily Active Users
dummy data!dummy data!
Example Report 2 – Comparison of Cohort Retention Over Time
dummy data!dummy data!
Ads
Email
Social
Number of
users who
bought a
subscription.
Split by traffic
source type
(where user
came from)
Calendar Day
Example Report 3 – Payer Conversions By Traffic Source
dummy data!dummy data!
● Landing page visit
○ URL with UTM tags
○ Referrer
● Subscription purchased
○ Is first in subscription
Example: Data
Everything is an Event
Example: Data
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
…
}
{
"eventName": "subscribe",
"period": "12 months",
…
}
Enrich and/or Join
Example: Data
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
…
}
{
"eventName": "subscribe",
"period": "12 months",
…
}
Slice by Plot
capture enrich index query
Analytics @ Consumer Product Company
capture enrich index query
Use 3rd Party?
1. Integrated Event Analytics
2. UI over your DB
Reports are not tailored for your
needs, limited capability.
Pre-aggregation / enriching
is still on you.
Hard to achieve accuracy and trust.
capture enrich index query
Build Step 1: Capture
● Always up, resilient
● Spikes / back pressure
● Buffer for delayed processing
Kafka
Capture
REST
{
"eventName": "page-visit",
"url": "...?utm_medium=paid",
…
}
Long-term
Storage
StreamKafka
Save To Long-Term Storage
Cassandra
micro-batch
Kafka
Save To Long-Term Storage
val rdd = KafkaUtils.createRDD[K, V](...)
rdd.saveToCassandra("raw")
capture enrich index query
Build Step 2: Enrich
Enrichment 1: User Attribution
Enrichment 1: User Attribution
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
…
}
Enrichment 1: User Attribution
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
…
}
"attributedUserId": 123,
t
Non-authenticated:
userId = null
Authenticated:
userId = 123
fingerprint = abc
(All events from a
given browser)
Enrichment 1: User Attribution
t
Non-authenticated:
userId = null
attributedUserId = 123
Authenticated:
userId = 123
fingerprint = abc
(All events from a
given browser)
Enrichment 1: User Attribution
Authenticated:
userId = 756
Enrichment 1: User Attribution
tfingerprint = abc
(All events from a
given browser)
Authenticated:
userId = 123
Heuristics to
attribute those
Authenticated:
userId = 756
Enrichment 1: User Attribution
tfingerprint = abc
(All events from a
given browser)
Authenticated:
userId = 123
Heuristics to
attribute those
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
rdd.mapPartitions { iterator =>
val buffer = new ArrayBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
Can grow big and
OOM your worker for
outliers who use
Grammarly without
ever registering
By default we
should operate
in User Memory
(small fraction).
Spark & Memory
User Memory
100% - spark.memory.fraction = 25%
Spark Memory
spark.memory.fraction = 75%
Let’s get into
Spark Memory
and use its
safety features.
rdd.mapPartitions { iterator =>
val buffer = new SpillableBuffer()
iterator
.takeWhile(_.userId.isEmpty)
.foreach(buffer.append)
val userId = iterator.head.userId
buffer.map(_.setAttributedUserId(userId)) ++ iterator
}
Enrichment 1: User Attribution
Can safely grow in
mem while enough
free Spark Mem. Spills
to disk otherwise.
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory
×2
Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spark Memory Manager & Spillable Collection
Memory
Spill to Disk
Disk
Spark Memory Manager & Spillable Collection
Memory Disk
Spill to Disk
Spark Memory Manager & Spillable Collection
Memory Disk
trait SizeTracker {
def afterUpdate(): Unit = { … }
def estimateSize(): Long = { … }
}
Call on every append.
Periodically estimates size
and saves samples.
Extrapolates
SizeTracker
trait Spillable {
abstract def spill(inMemCollection: C): Unit
def maybeSpill(currentMemory: Long, inMemCollection: C) {
try x2 if needed
}
}
Spillable
call on every append to collection
public long acquireExecutionMemory(long required, …)
public void releaseExecutionMemory(long size, …)
TaskMemoryManager
● Be safe with outliers
● Get outside User Memory (25%), use Spark Memory (75%)
● Spark APIs: Could be a bit friendlier and high level
Custom Spillable Collection
Enrichment 2: Calculable Props
Enrichment Phase 2: Calculable Props
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
"attributedUserId": 123,
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
…
}
Enrichment Phase 2: Calculable Props
{
"eventName": "page-visit",
"url": "...?utm_medium=ad",
"fingerprint": "abc",
"attributedUserId": 123,
…
}
{
"eventName": "subscribe",
"userId": 123,
"fingerprint": "abc",
"firstUtmMedium": "ad",
…
}
val firstUtmMedium: CalcProp[String] =
(E  "url").as[Url]
.map(_.param("utm_source"))
.forEvent("page-visit")
.first
Enrichment Phase 2: Calculable Props Engine & DSL
● Type-safe, functional, composable
● Familiar: similar to Scala collections API
● Batch & Stream (incremental)
Enrichment Phase 2: Calculable Props Engine & DSL
Enrichment Pipeline with Spark
Raw
Kafka
Spark Pipeline
Stream:
Save Raw Kafka
Stream:
User Attr.
User-attributed
Kafka
Stream:
Calc Props
Enriched and
Queryable
Batch:
User Attr.
Batch:
Calc Props
batch
micro-batchmicro-batchmicro-batch
Cassandra
Kafka
Spark Pipeline
Kafka
Cassandra
Kafka
Parquet on
AWS S3
batch
● Connectors for everything
● Great for batch
○ Shuffle with spilling
○ Failure recovery
● Great for streaming
○ Fast
○ Low overhead
Spark Pipeline
batch
micro-batchmicro-batchmicro-batch
Cassandra
Kafka
Spark Pipeline
Kafka
Cassandra
Kafka
Parquet on
AWS S3
batch
job
Multiple Output Destinations
Kafka Kafka
CassandraCassandra
val rdd: RDD[T]
rdd.sendToKafka(“topic_x”)
rdd.saveToCassandra(“table_foo”)
rdd.saveToCassandra(“table_bar”)
Multiple Output Destinations: Try 1
rdd.saveToCassandra(...)
rdd.forEachPartition(...)
sc.runJob(...)
Multiple Output Destinations: Try 1
job
Multiple Output Destinations: Try 1
Kafka Kafka
CassandraCassandra
job
Multiple Output Destinations: Try 1 = 3 Jobs
Kafka Kafka
job
Kafka
Table 1
job
Kafka
Table 2
val rdd: RDD[T]
rdd.cache()
rdd.sendToKafka(“topic_x”)
rdd.saveToCassandra(“table_foo”)
rdd.saveToCassandra(“table_bar”)
Multiple Output Destinations: Try 2
job
Multiple Output Destinations: Try 2 = Read Once, 3 Jobs
Kafka Kafka
job
Cache
Table 1
job
Cache
Table 2
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
Writer
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
Writer
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream())
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
Writer
rdd.forEachPartition { iterator =>
val writer = new BufferedWriter(
new OutputStreamWriter(new FileOutStream(...))
)
iterator.forEach { el =>
writer.writeln(el)
}
writer.close() // makes sure this writes
}
● Buffer
● Non-blocking
● Idempotent / Dedupe
Writer
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer.write(el)
}.closing(() => writer.close)
}
AndWriter
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer.write(el)
}.closing(() => writer.close)
}
AndWriter
andWriteToX = rdd.mapPartitions { iterator =>
val writer = new XWriter()
val writingIterator = iterator.map { el =>
writer.write(el)
}.closing(() => writer.close)
}
AndWriter
val rdd: RDD[T]
rdd.andSaveToCassandra(“table_foo”)
.andSaveToCassandra(“table_bar”)
.sendToKafka(“topic_x”)
Multiple Output Destinations: Try 3
job
Multiple Output Destinations: Try 3
Kafka Kafka
CassandraCassandra
● Kafka
● Cassandra
● HDFS
Important! Each andWriter will consume resources
● Memory (buffers)
● IO
And Writer
capture enrich index query
Build Step 3: Index
Index
● Parquet on AWS S3
● Custom partitioning: By eventName and time interval
● Append changes, compact + merge on the fly when querying
● Randomized names to maximize S3 parallelism
● Use s3a for max performance and and tweak for S3 read-after-write
consistency
● Support flexible schema, even with conflicts!
Some Stats
● Thousands of
events
per second
● Terabytes of
compressed
data
capture enrich index query
Build Step 4: Query
● DataFrames
● Spark SQL Scala dsl / Pure SQL
● Zeppelin
Hardcore Query
● Plot by day
● Unique visitors
● Filter by country
● Split by traffic source (top 20)
● Time from 2 weeks ago to today
Casual Query
Option 1: SQL
Quickly gets complex
Too expensive
to build, extend
and support
Option 2: UI
SEGMENT “eventName”
WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”)
UNIQUE
BY m IS NOT NULL
TIME from 2 months ago to today
STEP 1 month SPAN 1 week
Option 3: Custom Query Language
SEGMENT “eventName”
WHERE foo = “bar” AND x.y IN (“a”, “b”, “c”)
UNIQUE
BY m IS NOT NULL
TIME from 2 months ago to today
STEP 1 month SPAN 1 week
Option 3: Custom Query Language
Expressions
● Segment, Funnel, Retention
● UI & as DataFrame in Zeppelin
● Spark <= 1.6 – Scala Parser Combinators
● Reuse most complex part of expression parser
● Relatively extensible
Option 3: Custom Query Language
Option 3: Custom Query Language
● Custom versatile analytics is doable and enjoyable
● Spark is a great platform to build analytics on top of
○ Enrichment Pipeline: Batch / Streaming, Query, ML
● Would be cool to see even deep internals slightly more extensible
Conclusion
We are hiring!
olivia@grammarly.com
https://www.grammarly.com/jobs
Thank you!
Questions?

Mais conteúdo relacionado

Mais procurados

Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...HostedbyConfluent
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!Vitor Oliveira
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Till Rohrmann
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

Mais procurados (20)

Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
Intelligent Auto-scaling of Kafka Consumers with Workload Prediction | Ming S...
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Semelhante a Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developersChristopher Batey
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationSean Chittenden
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMVoxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMManuel Bernhardt
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API'sNatalino Busa
 

Semelhante a Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov (20)

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVMVoxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
Voxxed Days Vienna - The Why and How of Reactive Web-Applications on the JVM
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Último (13)

Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail Chernetsov