SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
CLICKSTREAM ANALYSIS
WITH APACHE SPARK
Andreas	Zitzelsberger
THE CHALLENGE
ONE POT TO RULE THEM ALL
Web Tracking Ad Tracking
ERP
CRM
▪
Products
▪
Inventory
▪
Margins
▪
Customer
▪
Orders
▪
Creditworthiness
▪
Ad Impressions
▪
Ad Costs
▪
Clicks &
Views
▪
Conversions
ONE POT TO RULE THEM ALL
Retention Reach
Monetarization
steer …
▪ Campaigns
▪ Offers
▪ Contents
REACT ON WEB SITE
TRAFFIC IN REAL TIME
Image: https://www.flickr.com/photos/nick-m/3663923048
SAMPLE RESULTS
Geolocated and gender-
specific conversions.
Frequency of visits
Performance of an ad campaign
THE CONCEPTS
Image: Randy Paulino
THE FIRST SKETCH
(= real-time)
SQL
CALCULATING USER JOURNEYS
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:
▪ Unique users
▪ Conversions
▪ Ad costs / conversion value
▪ …
V
X
VT
C Click
View
View Time
Conversion
THE ARCHITECTURE
Big Data
„LARRY & FRIENDS“ ARCHITECTURE
Runs not well for more
than 1 TB data in terms of
ingestion speed, query time
and optimization efforts
Image: adweek.com
Nope.
Sorry, no Big Data.
„HADOOP & FRIENDS“ ARCHITECTURE
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough
Nope.	
Too	sluggish.
Κ-ARCHITECTURE
Cumbersome
programming model
Over-engineered: We only need
15min real-time ;-)
Stateful aggregations (unique x,
conversions) require a separate DB
with high throughput and fast
aggregations & lookups.
Λ-ARCHITECTURE
Cumbersome
programming model
Complex
architecture
Redundant
logic
FEELS OVER-ENGINEERED…
http://www.brainlazy.com/article/random-nonsense/over-engineered
The Final Architecture*
*) Maybe called µ-architecture one day ;-)
FUNCTIONAL ARCHITECTURE
Strange Events
IngestionRaw Event
Stream
Collection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
▪ Buffers load peeks
▪ Ensures message
delivery (fire & forget
for client)
▪ Create user journeys and
unique user sets
▪ Enrich dimensions
▪ Aggregate events to KPIs
▪ Ability to replay for schema
evolution
▪ The representation of truth
▪ Multidimensional data
model
▪ Interactive queries for
actions in realtime and
data exploration
▪ Eternal memory for all
events (even strange
ones)
▪ One schema per event
type. Time partitioned.
▪ Fault tolerant message handling
▪ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
▪ Tolerates delayed events
▪ High throughput, moderate latency (~ 1min)
SERIAL CONNECTION OF STREAMING AND BATCHING
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
▪ Cool programming model
▪ Uniform dev&ops
▪ Simple solution
▪ High compression ratio due to
column-oriented storage
▪ High scan speed
▪ Cool programming model
▪ Uniform dev&ops
▪ High performance
▪ Interface to R out-of-the-box
▪ Useful libs: MLlib, GraphX, NLP, …
▪ Good connectivity (JDBC,
ODBC, …)
▪ Interactive queries
▪ Uniform ops
▪ Can easily be replaced
due to Hive Metastore
▪ Obvious choice for
cloud-scale messaging
▪ Way the best throughput
and scalability of all
evaluated alternatives
public Map<Long, UserJourney>
sessionize(JavaRDD<AtomicEvent> events) {


return events

// Convert to a pair RDD with the userId as key

.mapToPair(e -> new Tuple2<>(e.getUserId(), e))

// Build user journeys

.<UserJourneyAcc>combineByKey(
UserJourneyAcc::create,
UserJourneyAcc::add,
UserJourneyAcc::combine)

// Convert to a Java map

.collectAsMap();

}
STREAM VERSUS BATCH
https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/
APACHE FLINK
■ Also	has	a	nice,	Spark-like	API	
■ Promises	similar	or	better	
performance	than	spark	
■ Looks	like	the	best	solution	for	a	κ-
Architecture	
■ But	it’s	also	the	newest	kid	on	the	
block
EVENT VERSUS PROCESSING TIME
■ There’s	a	difference	between	even	time	(te)	and	processing	time	
(tp).	
■ Events	arrive	out-of	order	even	during	normal	operation.	
■ Events	may	arrive	arbitrary	late.
Apply	a	grace	period	before	processing	events.
Allow	arbitrary	update	windows	of	metrics.
EXAMPLE
Minute
Hour
Day
Week
Month
Quarter
Year
I
U
U
U
U
U
U
I
U
U
U
U
U
U
U
Resolution	

in	Time
Time
dtp
tp
tp:	 Processing	Time	
ti:	 Ingestion	time	
te:	 Event	Time	
dtp:	 Aggregation	time		
	 frame	
dtw:	 Grace	period	
					:	 Insert	fact	
					:	 Update	fact
dtw
te
ti
LESSONS LEARNED
Image: http://hochmeister-alpin.at
BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS
ETL
Analytics
Realtime
Analytics
Slice &
Dice
Data
Exploration
Polyglot Processing
http://datadventures.ghost.io/2014/07/06/polyglot-processing
POLYGLOT ANALYTICS
Data Lake
Analytics
Warehouse
SQL 

lane
R

lane
Timeseries

lane
Reporting Data Exploration
Data Science
NO RETENTION PARANOIA
Data Lake
Analytics
Warehouse
▪ Eternal memory
▪ Close to raw events
▪ Allows replays and refills

into warehouse
Aggressive forgetting with clearly defined 

retention policy per aggregation level like:
▪ 15min:30d
▪ 1h:4m
▪ …
Events
Strange Events
BEWARE OF THE HIPSTERS
Image: h&m
ENSURE YOUR SOFTWARE RUNS LOCALLY
The entire architecture must be able to run locally.
Keep the round trips low for development and
testing.
Throughput and reaction times need to be monitored
continuously. Tune your software and the underlying
frameworks as needed.
TUNE CONTINUOUSLY
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
generator Throughput & latency probes
System, container and process monitoring
IN NUMBERS
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
THANK YOU
@andreasz82
andreas.zitzelsberger@qaware.de
BONUS SLIDES
CALCULATING UNIQUE USERS
■ We	need	an	exact	unique	user	
count.	
■ If	you	can,	you	should	use	an	
approximation	such	as	
HyperLogLog.
U1
U2
U3
U1
U4
Time
Users
3 UU 2 UU
4 UU
Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the
Analysis of Algorithms.
CHARTING TECHNOLOGY
https://github.com/qaware/big-data-landscape
CHOOSING WHERE TO AGGREGATE
Ingestion Event Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
- Enrichment
- Preprocessing
- Validation
The hard lifting.
- Processing steps that can
be done at query time.
- Interactive queries.

Mais conteúdo relacionado

Mais procurados

Building a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformBuilding a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformJampp
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againAlexander Dean
 
Modelling event data in look ml
Modelling event data in look mlModelling event data in look ml
Modelling event data in look mlyalisassoon
 
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019confluent
 
Cqrs + event sourcing pyxis v2 - en
Cqrs + event sourcing   pyxis v2 - enCqrs + event sourcing   pyxis v2 - en
Cqrs + event sourcing pyxis v2 - enEric De Carufel
 
2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modelingyalisassoon
 
Big Data Expo 2015 - Anchormen Enter the Lambda-architecture
Big Data Expo 2015 - Anchormen Enter the Lambda-architectureBig Data Expo 2015 - Anchormen Enter the Lambda-architecture
Big Data Expo 2015 - Anchormen Enter the Lambda-architectureBigDataExpo
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...confluent
 
RedisConf18 - Redis Analytics Use Cases
RedisConf18 - Redis Analytics Use CasesRedisConf18 - Redis Analytics Use Cases
RedisConf18 - Redis Analytics Use CasesRedis Labs
 
Modernising Change - Lime Point - Confluent - Kong
Modernising Change - Lime Point - Confluent - KongModernising Change - Lime Point - Confluent - Kong
Modernising Change - Lime Point - Confluent - Kongconfluent
 
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020confluent
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitRekha Joshi
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event datayalisassoon
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
 
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward
 
Spreadsheets To API
Spreadsheets To APISpreadsheets To API
Spreadsheets To APIChris Haddad
 
Analyser vos logs avec Ingensi
Analyser vos logs avec IngensiAnalyser vos logs avec Ingensi
Analyser vos logs avec IngensiCyrès
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and CascalogRobert Boland
 
How we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingHow we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingyalisassoon
 
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler AnswersLambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers VoltDB
 

Mais procurados (20)

Building a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformBuilding a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platform
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
 
Modelling event data in look ml
Modelling event data in look mlModelling event data in look ml
Modelling event data in look ml
 
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
 
Cqrs + event sourcing pyxis v2 - en
Cqrs + event sourcing   pyxis v2 - enCqrs + event sourcing   pyxis v2 - en
Cqrs + event sourcing pyxis v2 - en
 
2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling2016 09 measurecamp - event data modeling
2016 09 measurecamp - event data modeling
 
Big Data Expo 2015 - Anchormen Enter the Lambda-architecture
Big Data Expo 2015 - Anchormen Enter the Lambda-architectureBig Data Expo 2015 - Anchormen Enter the Lambda-architecture
Big Data Expo 2015 - Anchormen Enter the Lambda-architecture
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
 
RedisConf18 - Redis Analytics Use Cases
RedisConf18 - Redis Analytics Use CasesRedisConf18 - Redis Analytics Use Cases
RedisConf18 - Redis Analytics Use Cases
 
Modernising Change - Lime Point - Confluent - Kong
Modernising Change - Lime Point - Confluent - KongModernising Change - Lime Point - Confluent - Kong
Modernising Change - Lime Point - Confluent - Kong
 
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event data
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
 
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
 
Spreadsheets To API
Spreadsheets To APISpreadsheets To API
Spreadsheets To API
 
Analyser vos logs avec Ingensi
Analyser vos logs avec IngensiAnalyser vos logs avec Ingensi
Analyser vos logs avec Ingensi
 
Snowplow, Metail and Cascalog
Snowplow, Metail and CascalogSnowplow, Metail and Cascalog
Snowplow, Metail and Cascalog
 
How we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingHow we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changing
 
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler AnswersLambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
 

Semelhante a Clickstream Analysis With Apache Spark

Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateSrinath Perera
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - AnalyticsDassana Wijesekara
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScyllaDB
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_DataNAVER D2
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingDatabricks
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event SourcingMike Bild
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesDATAVERSITY
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processJampp
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud
 
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...StampedeCon
 
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...Lucas Jellema
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS SoftServe
 

Semelhante a Clickstream Analysis With Apache Spark (20)

Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Assessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use CasesAssessing New Databases– Translytical Use Cases
Assessing New Databases– Translytical Use Cases
 
Processing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
 
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
Real Time Event Processing and In-­memory analysis of Big Data - StampedeCon ...
 
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Clickstream Analysis With Apache Spark