SlideShare uma empresa Scribd logo
1 de 30
Building robust data pipelines in
Scala: the Snowplow experience
Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow,
the open-source event analytics platform
based here in London [1]
• Weekend writer of Unified Log Processing,
available on the Manning Early Access Program
[2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean
Snowplow – what is it?
Snowplow is an open source event analytics platform
1a. Trackers
2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D
A D Standardised data protocols
1b. Webhooks
A
• Your granular, event-level and customer-level
data, in your own data warehouse
• Connect any analytics tool to your data
• Join your event data with any other data set
Today almost all users/customers are running a batch-based
Snowplow configuration
Hadoop-
based
enrichment
Snowplow
event
tracking SDK
Amazon
Redshift
Amazon S3
HTTP-based
event
collector
• Batch-based
• Normally run overnight;
sometimes every 4-6 hours
We also have a real-time pipeline for Snowplow in beta, built on
Amazon Kinesis (Apache Kafka support coming next year)
scala-
stream-
collector
scala-
kinesis-
enrich
S3
Redshift
S3 sink
Kinesis app
Redshift
sink
Kinesis app
Snowplow
Trackers
= not yet released
kinesis-
elasticsearch-
sink
DynamoDB
Elastic-
search
Event
aggregator
Kinesis app
Analytics on
Read for agile
exploration of
events, machine
learning,
auditing, re-
processing…
Analytics on Write for operational
reporting, real-time dashboards,
audience segmentation,
personalization…
Raw
event
stream
Bad raw
event
stream
Enriched
event
stream
Snowplow and Scala
Today, Snowplow is primarily developed in Scala
Data modelling
scripts
• Used for Snowplow
orchestration
• No event-level
processing occurs in
Ruby
• Used for event
validation, enrichment
and other processing
• Increasingly used for
event storage
• Starting to be used for
event collection too
Our initial skunkworks version of Snowplow had no Scala 
Website / webapp
Snowplow data pipeline v1
CloudFront-
based pixel
collector
HiveQL +
Java UDF
“ETL”
Amazon S3
JavaScript
event tracker
But our schema-first, loosely coupled approach made it possible
to start swapping out existing components…
Website / webapp
Snowplow data pipeline v2
CloudFront-
based event
collector
Scalding-
based
enrichment
JavaScript
event tracker
HiveQL +
Java UDF
“ETL”
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Hive Pig
Java
Scalding Cascalog PyCascading
cascading.
jruby
We chose Cascading because we liked their “plumbing”
abstraction over vanilla MapReduce
Why did we choose Scalding instead of one of the other
Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the
ground running (only very basic awareness of Clojure
when we started the project)
• Scalding created and supported by Twitter, who use it
throughout their organization – so we knew it was a
safe long-term bet
• More controversial opinion (although maybe not at a
Scala conference): we believe that data pipelines
should be as strongly typed as possible – all the other
DSLs/APIs on top of Cascading encourage dynamic
typing
Robust data pipelines
Robust data pipelines means strongly typed data pipelines –
why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:
Robust data processing is a state of mind: failures will happen,
don’t panic, but don’t sweep them under the carpet either
• Our basic processing model for Snowplow looks like this:
• Looks familiar? stdin, stdout, stderr
Raw events
Snowplow
enrichment
process
“Bad” raw
events +
reasons why
they are bad
“Good”
enriched
events
This pattern is extremely composable, especially with Kinesis or
Kafka streams/topics as the core building block
Validation, the “gateway
drug” to Scalaz
Inside and across our components, we use the Validation
applicative functor from the Scalaz project extensively
• Scalaz Validation lets us perform a variety of different event validations and
enrichments, and then compose (i.e. collate) the failures
• This is really powerful!
• The Scalaz codebase calls |@| a “DSL for constructing
Applicative expressions” – I think of it as “the Scream operator”
• Individual components of the enrichment process can themselves collate their
own internal failures
There is a great F# article by Scott Wlaschin which describes this
approach as “railway-oriented programming” [1]
The Happy Path
• If everything succeeds, then this path outputs an enriched event
• Any individual failure along the path could switch us onto the
failure path
• We never get back onto the happy path once we leave it
The Failure Path
• Any failure can take us onto the failure path
• We can choose whether to switch straight to the
failure path (“fail fast”), or collate failures from
multiple independent tests
[1] http://fsharpforfunandprofit.com/posts/recipe-part2/
Putting it all together, the Snowplow enrichment process boils
down to one big type transformation
• Types abstracting over simpler types
• No mutable state
• Railway-oriented programming
• Collate failures inside a processing stage, fail fast between processing stages
• Using Scott Wlaschin’s “fruit as cargo” metaphor:
• Currently Snowplow uses a Non-Empty List of Strings to collect our failures:
• We are working on a ProcessingMessage case class, to capture much richer and
more structured failures than we can using Strings
The only limitation is that the Failure Path restricts us to a single
type
A brief aside on testing
On the testing side: we love Specs2 data tables…
• They let us test a variety of inputs and expected outputs without making the
mistake of just duplicating the data processing functionality in the test:
… and are starting to do more with ScalaCheck
• ScalaCheck is a property-based testing framework, originally inspired by
Haskell’s QuickCheck
• We use it in a few places –
including to generate
unpredictable bad data and
also to validate our new Thrift
schema for raw Snowplow
events:
Robustness in the face of
user-defined types
Snowplow is evolving from a fixed-schema platform to a
platform supporting user-defined JSONs
• Where other analytics tools depend on schema-less JSONs or custom variables,
we use JSON Schema
• Snowplow users send in events as “self-describing JSONs” which have to include
the schema URI which validates the event’s JSON body:
To support JSON Schema, we have open-sourced Iglu, a new
schema repository system in Scala/Spray/Swagger/Jackson
Our Scala client library for Iglu lets us work with JSONs in a safe
way from within Snowplow
• If a JSON passes its JSON Schema validation, we should be able to deserialize it
and work with it safely in Scala in a strongly-typed way:
• We use json4s with the Jackson bindings, as JSON Schema support in Java/Scala
is Jackson-based
• We still wrap our JSON deserialization in Scalaz Validations in case of any
mismatch between the Scala deserialization code and the JSON schema
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To meet up or chat, @alexcrdean on Twitter or
alex@snowplowanalytics.com
Discount code: ulogprugcf (43% off
Unified Log Processing eBook)

Mais conteúdo relacionado

Mais procurados

Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...confluent
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platformLi Gao
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in PracticeNavneet kumar
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka AggregationKafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregationconfluent
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzDatabricks
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...HostedbyConfluent
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 
Neo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafkajexp
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...HostedbyConfluent
 
When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackJohn Burwell
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafkaconfluent
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know confluent
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Amazon Web Services
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
 

Mais procurados (20)

Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent)  K...
Event Sourcing, Stream Processing and Serverless (Ben Stopford, Confluent) K...
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka AggregationKafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Neo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache KafkaNeo4j Graph Streaming Services with Apache Kafka
Neo4j Graph Streaming Services with Apache Kafka
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
 
When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStack
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafka
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 

Destaque

Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented ProgrammingScott Wlaschin
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Scalaz-StreamによるFunctional Reactive Programming
Scalaz-StreamによるFunctional Reactive ProgrammingScalaz-StreamによるFunctional Reactive Programming
Scalaz-StreamによるFunctional Reactive ProgrammingTomoharu ASAMI
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmerGirish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Business Model Template
Business Model TemplateBusiness Model Template
Business Model TemplateGina Evans
 
10 Key Takeaways We Learned From Glassdoor Employer Branding Summit
10 Key Takeaways We Learned From Glassdoor Employer Branding Summit10 Key Takeaways We Learned From Glassdoor Employer Branding Summit
10 Key Takeaways We Learned From Glassdoor Employer Branding SummitGlassdoor
 
A discussion of the brooklyn staten island athletic league's basketball division
A discussion of the brooklyn staten island athletic league's basketball divisionA discussion of the brooklyn staten island athletic league's basketball division
A discussion of the brooklyn staten island athletic league's basketball divisionCraig Raucher New York
 
Social Media สำหรับธุรกิจท่องเที่ยว
Social Media สำหรับธุรกิจท่องเที่ยว Social Media สำหรับธุรกิจท่องเที่ยว
Social Media สำหรับธุรกิจท่องเที่ยว Khonkaen University
 
Managing teacher capability - HR and employment law in education conference 2...
Managing teacher capability - HR and employment law in education conference 2...Managing teacher capability - HR and employment law in education conference 2...
Managing teacher capability - HR and employment law in education conference 2...Browne Jacobson LLP
 
Critical Disability Theory and Non-Refoulement
Critical Disability Theory and Non-RefoulementCritical Disability Theory and Non-Refoulement
Critical Disability Theory and Non-RefoulementAndreas Dimopoulos
 
Blue Ocean Strategy and the Transistor Radio
Blue Ocean Strategy and the Transistor RadioBlue Ocean Strategy and the Transistor Radio
Blue Ocean Strategy and the Transistor RadioChris Sandström
 

Destaque (20)

Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented Programming
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Scalaz-StreamによるFunctional Reactive Programming
Scalaz-StreamによるFunctional Reactive ProgrammingScalaz-StreamによるFunctional Reactive Programming
Scalaz-StreamによるFunctional Reactive Programming
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
A Career in Libraries
A Career in LibrariesA Career in Libraries
A Career in Libraries
 
Expensive cars
Expensive carsExpensive cars
Expensive cars
 
Business Model Template
Business Model TemplateBusiness Model Template
Business Model Template
 
10 Key Takeaways We Learned From Glassdoor Employer Branding Summit
10 Key Takeaways We Learned From Glassdoor Employer Branding Summit10 Key Takeaways We Learned From Glassdoor Employer Branding Summit
10 Key Takeaways We Learned From Glassdoor Employer Branding Summit
 
How Federal Deposit Insurance Works
How Federal Deposit Insurance WorksHow Federal Deposit Insurance Works
How Federal Deposit Insurance Works
 
A discussion of the brooklyn staten island athletic league's basketball division
A discussion of the brooklyn staten island athletic league's basketball divisionA discussion of the brooklyn staten island athletic league's basketball division
A discussion of the brooklyn staten island athletic league's basketball division
 
Social Media สำหรับธุรกิจท่องเที่ยว
Social Media สำหรับธุรกิจท่องเที่ยว Social Media สำหรับธุรกิจท่องเที่ยว
Social Media สำหรับธุรกิจท่องเที่ยว
 
Association for the Development of Pakistan (ADP) 2014 YTD update
Association for the Development of Pakistan (ADP) 2014 YTD updateAssociation for the Development of Pakistan (ADP) 2014 YTD update
Association for the Development of Pakistan (ADP) 2014 YTD update
 
Managing teacher capability - HR and employment law in education conference 2...
Managing teacher capability - HR and employment law in education conference 2...Managing teacher capability - HR and employment law in education conference 2...
Managing teacher capability - HR and employment law in education conference 2...
 
Critical Disability Theory and Non-Refoulement
Critical Disability Theory and Non-RefoulementCritical Disability Theory and Non-Refoulement
Critical Disability Theory and Non-Refoulement
 
Blue Ocean Strategy and the Transistor Radio
Blue Ocean Strategy and the Transistor RadioBlue Ocean Strategy and the Transistor Radio
Blue Ocean Strategy and the Transistor Radio
 

Semelhante a Scala eXchange: Building robust data pipelines in Scala

Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againAlexander Dean
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowAlexander Dean
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Saltconf16 william-cannon b
Saltconf16 william-cannon bSaltconf16 william-cannon b
Saltconf16 william-cannon bWilliam Cannon
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
OWASP DefectDojo - Open Source Security Sanity
OWASP DefectDojo - Open Source Security SanityOWASP DefectDojo - Open Source Security Sanity
OWASP DefectDojo - Open Source Security SanityMatt Tesauro
 
AWS Sydney Summit 2013 - Architecting for High Availability
AWS Sydney Summit 2013 - Architecting for High AvailabilityAWS Sydney Summit 2013 - Architecting for High Availability
AWS Sydney Summit 2013 - Architecting for High AvailabilityAmazon Web Services
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Yuval Itzchakov
 
Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015
Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015
Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015Christian Schneider
 
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdfDustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdfDustin Vannoy
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo Amazon Web Services
 
Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplowyalisassoon
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookCarolyn Duby
 
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsDevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsAmazon Web Services
 
AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...
AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...
AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...Vadym Kazulkin
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
we45 DEFCON Workshop - Building AppSec Automation with Python
we45 DEFCON Workshop - Building AppSec Automation with Pythonwe45 DEFCON Workshop - Building AppSec Automation with Python
we45 DEFCON Workshop - Building AppSec Automation with PythonAbhay Bhargav
 

Semelhante a Scala eXchange: Building robust data pipelines in Scala (20)

Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Saltconf16 william-cannon b
Saltconf16 william-cannon bSaltconf16 william-cannon b
Saltconf16 william-cannon b
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
OWASP DefectDojo - Open Source Security Sanity
OWASP DefectDojo - Open Source Security SanityOWASP DefectDojo - Open Source Security Sanity
OWASP DefectDojo - Open Source Security Sanity
 
AWS Sydney Summit 2013 - Architecting for High Availability
AWS Sydney Summit 2013 - Architecting for High AvailabilityAWS Sydney Summit 2013 - Architecting for High Availability
AWS Sydney Summit 2013 - Architecting for High Availability
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015
Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015
Security DevOps: Wie Sie in agilen Projekten trotzdem sicher bleiben // JAX 2015
 
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdfDustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo
 
Big data meetup budapest adding data schemas to snowplow
Big data meetup budapest   adding data schemas to snowplowBig data meetup budapest   adding data schemas to snowplow
Big data meetup budapest adding data schemas to snowplow
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
 
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer ToolsDevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
DevOps on AWS: Accelerating Software Delivery with the AWS Developer Tools
 
AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...
AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...
AWS Lambda SnapStart: Why, How and What AWS Serverless Meetup New York Boston...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
we45 DEFCON Workshop - Building AppSec Automation with Python
we45 DEFCON Workshop - Building AppSec Automation with Pythonwe45 DEFCON Workshop - Building AppSec Automation with Python
we45 DEFCON Workshop - Building AppSec Automation with Python
 

Último

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Último (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Scala eXchange: Building robust data pipelines in Scala

  • 1. Building robust data pipelines in Scala: the Snowplow experience
  • 2. Introducing myself • Alex Dean • Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1] • Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2] [1] https://github.com/snowplow/snowplow [2] http://manning.com/dean
  • 4. Snowplow is an open source event analytics platform 1a. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsB C D A D Standardised data protocols 1b. Webhooks A • Your granular, event-level and customer-level data, in your own data warehouse • Connect any analytics tool to your data • Join your event data with any other data set
  • 5. Today almost all users/customers are running a batch-based Snowplow configuration Hadoop- based enrichment Snowplow event tracking SDK Amazon Redshift Amazon S3 HTTP-based event collector • Batch-based • Normally run overnight; sometimes every 4-6 hours
  • 6. We also have a real-time pipeline for Snowplow in beta, built on Amazon Kinesis (Apache Kafka support coming next year) scala- stream- collector scala- kinesis- enrich S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers = not yet released kinesis- elasticsearch- sink DynamoDB Elastic- search Event aggregator Kinesis app Analytics on Read for agile exploration of events, machine learning, auditing, re- processing… Analytics on Write for operational reporting, real-time dashboards, audience segmentation, personalization… Raw event stream Bad raw event stream Enriched event stream
  • 8. Today, Snowplow is primarily developed in Scala Data modelling scripts • Used for Snowplow orchestration • No event-level processing occurs in Ruby • Used for event validation, enrichment and other processing • Increasingly used for event storage • Starting to be used for event collection too
  • 9. Our initial skunkworks version of Snowplow had no Scala  Website / webapp Snowplow data pipeline v1 CloudFront- based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker
  • 10. But our schema-first, loosely coupled approach made it possible to start swapping out existing components… Website / webapp Snowplow data pipeline v2 CloudFront- based event collector Scalding- based enrichment JavaScript event tracker HiveQL + Java UDF “ETL” Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector
  • 11. What is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Hadoop DFS Hadoop MapReduce Cascading Hive Pig Java Scalding Cascalog PyCascading cascading. jruby
  • 12. We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce
  • 13. Why did we choose Scalding instead of one of the other Cascading DSLs/APIs? • Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project) • Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet • More controversial opinion (although maybe not at a Scala conference): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
  • 15. Robust data pipelines means strongly typed data pipelines – why? • Catch errors as soon as possible – and report them in a strongly typed way too • Define the inputs and outputs of each of your data processing steps in an unambiguous way • Forces you to formerly address the data types flowing through your system • Lets you write code like this:
  • 16. Robust data processing is a state of mind: failures will happen, don’t panic, but don’t sweep them under the carpet either • Our basic processing model for Snowplow looks like this: • Looks familiar? stdin, stdout, stderr Raw events Snowplow enrichment process “Bad” raw events + reasons why they are bad “Good” enriched events
  • 17. This pattern is extremely composable, especially with Kinesis or Kafka streams/topics as the core building block
  • 19. Inside and across our components, we use the Validation applicative functor from the Scalaz project extensively • Scalaz Validation lets us perform a variety of different event validations and enrichments, and then compose (i.e. collate) the failures • This is really powerful! • The Scalaz codebase calls |@| a “DSL for constructing Applicative expressions” – I think of it as “the Scream operator” • Individual components of the enrichment process can themselves collate their own internal failures
  • 20. There is a great F# article by Scott Wlaschin which describes this approach as “railway-oriented programming” [1] The Happy Path • If everything succeeds, then this path outputs an enriched event • Any individual failure along the path could switch us onto the failure path • We never get back onto the happy path once we leave it The Failure Path • Any failure can take us onto the failure path • We can choose whether to switch straight to the failure path (“fail fast”), or collate failures from multiple independent tests [1] http://fsharpforfunandprofit.com/posts/recipe-part2/
  • 21. Putting it all together, the Snowplow enrichment process boils down to one big type transformation • Types abstracting over simpler types • No mutable state • Railway-oriented programming • Collate failures inside a processing stage, fail fast between processing stages
  • 22. • Using Scott Wlaschin’s “fruit as cargo” metaphor: • Currently Snowplow uses a Non-Empty List of Strings to collect our failures: • We are working on a ProcessingMessage case class, to capture much richer and more structured failures than we can using Strings The only limitation is that the Failure Path restricts us to a single type
  • 23. A brief aside on testing
  • 24. On the testing side: we love Specs2 data tables… • They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:
  • 25. … and are starting to do more with ScalaCheck • ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck • We use it in a few places – including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:
  • 26. Robustness in the face of user-defined types
  • 27. Snowplow is evolving from a fixed-schema platform to a platform supporting user-defined JSONs • Where other analytics tools depend on schema-less JSONs or custom variables, we use JSON Schema • Snowplow users send in events as “self-describing JSONs” which have to include the schema URI which validates the event’s JSON body:
  • 28. To support JSON Schema, we have open-sourced Iglu, a new schema repository system in Scala/Spray/Swagger/Jackson
  • 29. Our Scala client library for Iglu lets us work with JSONs in a safe way from within Snowplow • If a JSON passes its JSON Schema validation, we should be able to deserialize it and work with it safely in Scala in a strongly-typed way: • We use json4s with the Jackson bindings, as JSON Schema support in Java/Scala is Jackson-based • We still wrap our JSON deserialization in Scalaz Validations in case of any mismatch between the Scala deserialization code and the JSON schema
  • 30. Questions? http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com Discount code: ulogprugcf (43% off Unified Log Processing eBook)