SlideShare uma empresa Scribd logo
1 de 29
Snowplow Analytics – From NoSQL 
to SQL and back 
London NoSQL, 17th November 2014
Introducing myself 
• Alex Dean 
• Co-founder and technical lead at Snowplow, 
the open-source event analytics platform 
based here in London [1] 
• Weekend writer of Unified Log Processing, 
available on the Manning Early Access Program 
[2] 
[1] https://github.com/snowplow/snowplow 
[2] http://manning.com/dean
So what’s Snowplow?
Snowplow is an event analytics platform 
Collect 
event data 
Warehouse 
event data Data warehouse 
Unified log 
Unified log 
Unified log 
Publish event 
data to a 
unified log 
Perform the high value 
analyses that drive the 
bottom line 
Act on your data in 
real-time
Snowplow was created as a response to the limitations of 
traditional web analytics programs: 
Data collection Data processing Data access 
• Sample-based (e.g. 
Google Analytics) 
• Limited set of events e.g. 
page views, goals, 
transactions 
• Limited set of ways of 
describing events 
(custom dim 1, custom 
dim 2…) 
• Data is processed ‘once’ 
• No validation 
• No opportunity to 
reprocess e.g. following 
update to business rules 
• Data is aggregated 
prematurely 
• Only particular 
combinations of metrics 
/ dimensions can be 
pivoted together 
(Google Analytics) 
• Only particular type of 
analysis are possible on 
different types of 
dimension (e.g. sProps, 
eVars, conversion goals 
in SiteCatalyst 
• Data is either aggregated 
(e.g. Google Analytics), 
or available as a 
complete log file for a 
fee (e.g. Adobe 
SiteCatalyst) 
• As a result, data is siloed: 
hard to join with other 
data sets
We took a fresh approach to digital analytics 
Other vendors tell you 
what to do with your data 
We give you your data so you can do 
whatever you want
How do users leverage their Snowplow event warehouse? 
Agile aka ad 
hoc analytics 
Enables… 
Marketing 
attribution 
modelling 
Customer 
lifetime value 
calculations 
Customer churn 
detection 
RTB fraud 
detection 
Product rec’ s 
Event warehouse
Early on, we made a crucial decision: Snowplow should be 
composed of a set of loosely coupled subsystems 
1. Trackers A 2. Collectors B 3. Enrich C 4. Storage D 5. Analytics 
Generate event 
data from any 
environment 
Log raw events 
from trackers 
Validate and 
enrich raw 
events 
D = Standardised data protocols 
Store enriched 
events ready 
for analysis 
Analyze 
enriched events 
These turned out to be critical to allowing us 
to evolve the above stack
Our data storage journey: 
starting with NoSQL
Our initial skunkworks version of Snowplow used Amazon S3 to 
store events, and then Hive to query them 
Website / webapp 
Snowplow data pipeline v1 
CloudFront-based 
pixel 
collector 
HiveQL + 
Java UDF 
“ETL” 
Amazon S3 
JavaScript 
event tracker 
• Batch-based 
• Normally run overnight; 
sometimes every 4-6 hours
We used a sparsely populated, de-normalized “fat table” 
approach for our events stored in Amazon S3
This got us started, but “time to report” was frustratingly slow 
for business analysts 
Amazon S3 
How many 
unique visitors 
did we have in 
October? 
What’s our 
average order 
value this year? 
What royalty 
payments 
should we 
invoice for this 
month? 
• Spin up transient EMR 
cluster 
• Log in to master node via 
SSH 
• Write HiveQL query (or 
adapt from our 
cookbook of recipes) 
• Hive kicks off 
MapReduce job 
• MapReduce job reads 
events stored in S3 
(slower than direct HDFS 
access) 
• Result is printed out in 
SSH terminal
From NoSQL to high-performance 
SQL
So we extended Snowplow to support columnar databases – after 
a first fling with Infobright, we integrated Amazon Redshift* 
Website, server, 
application or 
mobile app 
Hadoop-based 
enrichment 
Snowplow 
event 
tracking SDK 
Amazon S3 
Amazon 
Redshift 
HTTP-based 
event 
collector 
Infobright 
* For small users we also added PostgreSQL support, because Redshift and 
PostgreSQL have extremely similar APIs
Our existing sparsely populated, de-normalized “fat tables” 
turned out to be a great fit for columnar storage 
• In columnar databases, compression is done on individual 
columns across many different rows, so the wide rows don’t 
have a negative impact on storage/compression 
• Having all the potential events de-normalized in a single fat row 
meant we didn’t need to worry about JOIN performance in 
Redshift 
• The main downside was the brittleness of the events table: 
1. We found ourselves regularly ALTERing the table to add 
new event types 
2. Snowplow users and customers ended up with 
customized versions of the event table to meet their own 
requirements
We experimented with Redshift JOINs and found they 
could be performant 
• As long as two tables in Redshift have the same DISTKEY (for 
sharding data around the cluster) and SORTKEY (for sorting the 
row on disk), Redshift JOINs can be performant 
• Yes, even mega-to-huge joins! 
• This led us to a new relational architecture: 
• A parent table, atomic.events, containing our old legacy 
“full-fat” definition 
• Child tables containing individual JSONs representing new 
event types or bundles of context describing the event
Our new relational approach for Redshift 
• A typical Snowplow deployment in Redshift now looks like this: 
• In fact, the first thing a Snowplow analyst often does is “re-build” 
in a SQL view a company-specific “full-fat” table by 
JOINing in all their child tables
We built a custom process to perform safe shredding of 
JSONs into dedicated Redshift tables
This is working well – but there is a lot of room for 
improvement 
• Our shredding process is closely tied to Redshift’s innovative 
COPY FROM JSON functionality: 
• This is Redshift-specific – so we can’t extend our shredding 
process to other columnar databases e.g. Vertica, Netezza 
• The syntax doesn’t support nested shredding – which 
would allow us to e.g. intelligently shred an order into line 
items, products, customer etc 
• We have to maintain copies of the JSON Paths files required 
by COPY FROM JSON in all AWS regions 
• So, we plan to port the Redshift-specific aspects of our 
shredding process out of COPY FROM JSON into Snowplow and 
Iglu
Our data storage journey: to 
a mixed SQL / noSQL model
Snowplow is re-architecting around the unified log 
CLOUD VENDOR / OWN DATA CENTER 
Search 
Silo 
SOME LOW LATENCY LOCAL LOOPS 
E-comm 
Silo 
CRM 
SAAS VENDOR #2 
Email 
marketing 
ERP 
Silo 
CMS 
Silo 
SAAS VENDOR #1 
NARROW DATA SILOES 
Streaming APIs / 
web hooks 
LOW LATENCY WIDE DATA 
Unified log 
COVERAGE 
Archiving 
Hadoop 
< WIDE DATA 
COVERAGE > 
< FULL DATA 
HISTORY > 
FEW DAYS’ 
DATA HISTORY 
Systems 
monitoring 
Eventstream 
HIGH LATENCY LOW LATENCY 
Product rec’s 
Ad hoc 
analytics 
Management 
reporting 
Fraud 
detection 
Churn 
prevention 
APIs
The unified log is Amazon Kinesis, or Apache Kafka 
CLOUD VENDOR / OWN DATA CENTER 
Search 
Silo 
SOME LOW LATENCY LOCAL LOOPS 
E-comm 
Silo 
CRM 
SAAS VENDOR #2 
Email 
marketing 
ERP 
Silo 
CMS 
Silo 
SAAS VENDOR #1 
NARROW DATA SILOES 
Streaming APIs / 
web hooks 
Unified log 
Archiving 
Hadoop 
< WIDE DATA 
COVERAGE > 
< FULL DATA 
HISTORY > 
Systems 
monitoring 
Eventstream 
HIGH LATENCY LOW LATENCY 
Product rec’s 
Ad hoc 
analytics 
Management 
reporting 
Fraud 
detection 
Churn 
prevention 
APIs 
• Amazon Kinesis, a 
hosted AWS service 
• Extremely similar 
semantics to Kafka 
• Apache Kafka, an append-only, 
distributed, ordered 
commit log 
• Developed at LinkedIn to 
serve as their 
organization’s unified log
“Kafka is designed to allow a 
single cluster to serve as the 
central data backbone for a 
large organization” [1] 
[1] http://kafka.apache.org/
“if you squint a bit, you can see the 
whole of your organization's systems and 
data flows as a single distributed 
database. You can view all the individual 
query-oriented systems (Redis, SOLR, 
Hive tables, and so on) as just particular 
indexes on your data. ” [1] 
[1] http://engineering.linkedin.com/distributed-systems/ 
log-what-every-software-engineer-should-know-about-real-time-datas-unifying
In a unified log world, Snowplow will be feeding a mix of 
different SQL, NoSQL and stream databases 
Scala 
Stream 
Collector 
Raw 
event 
stream 
Enrich 
Kinesis 
app 
Bad raw 
events 
stream 
Enriched 
event 
stream 
S3 
Redshift 
S3 sink 
Kinesis app 
Redshift 
sink 
Kinesis app 
Snowplow 
Trackers 
= not yet released 
Elastic- 
Search sink 
Kinesis app 
DynamoDB 
Elastic- 
Search 
Event 
aggregator 
Kinesis app 
Analytics on 
Read (for agile 
exploration of 
event stream, 
ML, auditing, 
applying 
alternate 
models, 
reprocessing 
etc) 
Analytics on Write (for dashboarding, 
audience segmentation, RTB, etc)
We have already experimented with Neo4J for customer 
flow/path analysis [1] 
[1] http://snowplowanalytics.com/blog/2014/07/31/ 
using-graph-databases-to-perform-pathing-analysis-initial-experimentation-with-neo4j/
During our current work integrating Elasticsearch we discovered 
that common “NoSQL” databases need schemas too 
• A simple example of schemas in Elasticsearch: 
$ curl -XPUT 'http://localhost:9200/blog/contra/4' -d 
'{"t": ["u", 999]}' 
{"_index":"blog","_type":"contra","_id":"4","_version":1,"c 
reated":true} 
$ curl -XPUT 'http://localhost:9200/blog/contra/4' -d 
'{"p": [11, "q"]}' 
{"error":"MapperParsingException[failed to parse [p]]; 
nested: NumberFormatException[For input string: "q"]; 
","status":400} 
• Elasticsearch is doing automated “shredding” of incoming JSONs to 
index that data in Lucene
We are now working on our second shredder  
• Our Elasticsearch loader contains code to shred our events’ 
heterogeneous JSON arrays and dictionaries into a format that is 
compatible with Elasticsearch 
• This is conceptually a much simpler shredder than the one we had 
to build for Redshift 
• When we add Google BigQuery support, we will need to write yet 
another shredder to handle the specifics of that data store 
• Hopefully we can unify and generalize our shredding technology 
so it works across columnar, relational, document and graph 
databases – a big undertaking but super powerful!
Questions? 
Discount code: ulogprugcf (43% off 
Unified Log Processing eBook) 
http://snowplowanalytics.com 
https://github.com/snowplow/snowplow 
@snowplowdata 
To meet up or chat, @alexcrdean on Twitter or 
alex@snowplowanalytics.com

Mais conteúdo relacionado

Mais procurados

Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...
Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...
Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...Simplilearn
 
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...Memory Fabric Forum
 
Adopting Multi-Cloud Services with Confidence
Adopting Multi-Cloud Services with ConfidenceAdopting Multi-Cloud Services with Confidence
Adopting Multi-Cloud Services with ConfidenceKevin Hakanson
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
Comparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBaseComparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBaseAccumulo Summit
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)
 
On the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + RESTOn the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + RESTTim Donohue
 
Solutions Architect's Handbook 2nd Edition - Book Review
Solutions Architect's Handbook 2nd Edition - Book ReviewSolutions Architect's Handbook 2nd Edition - Book Review
Solutions Architect's Handbook 2nd Edition - Book ReviewAshraf Fouad
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about cephEmma Haruka Iwao
 
Mainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft AzureMainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft AzurePrecisely
 
Sap on azure airlift architecture (2)
Sap on azure airlift architecture (2)Sap on azure airlift architecture (2)
Sap on azure airlift architecture (2)Rahim Abdul Kader
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataWorks Summit
 
A complete guide to azure storage
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storageHimanshu Sahu
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 

Mais procurados (20)

Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Nifi
NifiNifi
Nifi
 
Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...
Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...
Microsoft Azure Training | Azure Training For Beginners | Azure Tutorial For ...
 
Snowflake Architecture
Snowflake ArchitectureSnowflake Architecture
Snowflake Architecture
 
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
Micron: Memory Expansion with CXL Modules: Benefits, Use Cases and Enriching ...
 
Adopting Multi-Cloud Services with Confidence
Adopting Multi-Cloud Services with ConfidenceAdopting Multi-Cloud Services with Confidence
Adopting Multi-Cloud Services with Confidence
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Comparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBaseComparing Accumulo, Cassandra, and HBase
Comparing Accumulo, Cassandra, and HBase
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
 
On the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + RESTOn the Road to DSpace 7: Angular UI + REST
On the Road to DSpace 7: Angular UI + REST
 
Solutions Architect's Handbook 2nd Edition - Book Review
Solutions Architect's Handbook 2nd Edition - Book ReviewSolutions Architect's Handbook 2nd Edition - Book Review
Solutions Architect's Handbook 2nd Edition - Book Review
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Mainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft AzureMainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft Azure
 
Sap on azure airlift architecture (2)
Sap on azure airlift architecture (2)Sap on azure airlift architecture (2)
Sap on azure airlift architecture (2)
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
A complete guide to azure storage
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storage
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 

Destaque

Modelling event data in look ml
Modelling event data in look mlModelling event data in look ml
Modelling event data in look mlyalisassoon
 
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015yalisassoon
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comyalisassoon
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
How we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingHow we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingyalisassoon
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we doyalisassoon
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016yalisassoon
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfigyalisassoon
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event datayalisassoon
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Lucidworks
 
Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeyalisassoon
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016yalisassoon
 

Destaque (12)

Modelling event data in look ml
Modelling event data in look mlModelling event data in look ml
Modelling event data in look ml
 
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
Chefsfeed presentation to Snowplow Meetup San Francisco, Oct 2015
 
Snowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.comSnowplow Analytics and Looker at Oyster.com
Snowplow Analytics and Looker at Oyster.com
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
How we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changingHow we use Hive at SnowPlow, and how the role of HIve is changing
How we use Hive at SnowPlow, and how the role of HIve is changing
 
Snowplow is at the core of everything we do
Snowplow is at the core of everything we doSnowplow is at the core of everything we do
Snowplow is at the core of everything we do
 
Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016Snowplow: where we came from and where we are going - March 2016
Snowplow: where we came from and where we are going - March 2016
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfig
 
Understanding event data
Understanding event dataUnderstanding event data
Understanding event data
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
Using Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMadeUsing Snowplow for A/B testing and user journey analysis at CustomMade
Using Snowplow for A/B testing and user journey analysis at CustomMade
 
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
Analytics at Carbonite: presentation to Snowplow Meetup Boston April 2016
 

Semelhante a Snowplow Analytics: from NoSQL to SQL and back again

Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAlexander Dean
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Amazon Web Services
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Hunk - Unlocking The Power of Big Data Breakout Session
Hunk - Unlocking The Power of Big Data Breakout SessionHunk - Unlocking The Power of Big Data Breakout Session
Hunk - Unlocking The Power of Big Data Breakout SessionSplunk
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationBen Stopford
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 

Semelhante a Snowplow Analytics: from NoSQL to SQL and back again (20)

Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
AWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified logAWS User Group UK: Why your company needs a unified log
AWS User Group UK: Why your company needs a unified log
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Hunk - Unlocking The Power of Big Data Breakout Session
Hunk - Unlocking The Power of Big Data Breakout SessionHunk - Unlocking The Power of Big Data Breakout Session
Hunk - Unlocking The Power of Big Data Breakout Session
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
A Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices GenerationA Global Source of Truth for the Microservices Generation
A Global Source of Truth for the Microservices Generation
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 

Mais de Alexander Dean

What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Alexander Dean
 

Mais de Alexander Dean (6)

What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified log
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
 

Último

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 

Último (20)

Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Snowplow Analytics: from NoSQL to SQL and back again

  • 1. Snowplow Analytics – From NoSQL to SQL and back London NoSQL, 17th November 2014
  • 2. Introducing myself • Alex Dean • Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1] • Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2] [1] https://github.com/snowplow/snowplow [2] http://manning.com/dean
  • 4. Snowplow is an event analytics platform Collect event data Warehouse event data Data warehouse Unified log Unified log Unified log Publish event data to a unified log Perform the high value analyses that drive the bottom line Act on your data in real-time
  • 5. Snowplow was created as a response to the limitations of traditional web analytics programs: Data collection Data processing Data access • Sample-based (e.g. Google Analytics) • Limited set of events e.g. page views, goals, transactions • Limited set of ways of describing events (custom dim 1, custom dim 2…) • Data is processed ‘once’ • No validation • No opportunity to reprocess e.g. following update to business rules • Data is aggregated prematurely • Only particular combinations of metrics / dimensions can be pivoted together (Google Analytics) • Only particular type of analysis are possible on different types of dimension (e.g. sProps, eVars, conversion goals in SiteCatalyst • Data is either aggregated (e.g. Google Analytics), or available as a complete log file for a fee (e.g. Adobe SiteCatalyst) • As a result, data is siloed: hard to join with other data sets
  • 6. We took a fresh approach to digital analytics Other vendors tell you what to do with your data We give you your data so you can do whatever you want
  • 7. How do users leverage their Snowplow event warehouse? Agile aka ad hoc analytics Enables… Marketing attribution modelling Customer lifetime value calculations Customer churn detection RTB fraud detection Product rec’ s Event warehouse
  • 8. Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems 1. Trackers A 2. Collectors B 3. Enrich C 4. Storage D 5. Analytics Generate event data from any environment Log raw events from trackers Validate and enrich raw events D = Standardised data protocols Store enriched events ready for analysis Analyze enriched events These turned out to be critical to allowing us to evolve the above stack
  • 9. Our data storage journey: starting with NoSQL
  • 10. Our initial skunkworks version of Snowplow used Amazon S3 to store events, and then Hive to query them Website / webapp Snowplow data pipeline v1 CloudFront-based pixel collector HiveQL + Java UDF “ETL” Amazon S3 JavaScript event tracker • Batch-based • Normally run overnight; sometimes every 4-6 hours
  • 11. We used a sparsely populated, de-normalized “fat table” approach for our events stored in Amazon S3
  • 12. This got us started, but “time to report” was frustratingly slow for business analysts Amazon S3 How many unique visitors did we have in October? What’s our average order value this year? What royalty payments should we invoice for this month? • Spin up transient EMR cluster • Log in to master node via SSH • Write HiveQL query (or adapt from our cookbook of recipes) • Hive kicks off MapReduce job • MapReduce job reads events stored in S3 (slower than direct HDFS access) • Result is printed out in SSH terminal
  • 13. From NoSQL to high-performance SQL
  • 14. So we extended Snowplow to support columnar databases – after a first fling with Infobright, we integrated Amazon Redshift* Website, server, application or mobile app Hadoop-based enrichment Snowplow event tracking SDK Amazon S3 Amazon Redshift HTTP-based event collector Infobright * For small users we also added PostgreSQL support, because Redshift and PostgreSQL have extremely similar APIs
  • 15. Our existing sparsely populated, de-normalized “fat tables” turned out to be a great fit for columnar storage • In columnar databases, compression is done on individual columns across many different rows, so the wide rows don’t have a negative impact on storage/compression • Having all the potential events de-normalized in a single fat row meant we didn’t need to worry about JOIN performance in Redshift • The main downside was the brittleness of the events table: 1. We found ourselves regularly ALTERing the table to add new event types 2. Snowplow users and customers ended up with customized versions of the event table to meet their own requirements
  • 16. We experimented with Redshift JOINs and found they could be performant • As long as two tables in Redshift have the same DISTKEY (for sharding data around the cluster) and SORTKEY (for sorting the row on disk), Redshift JOINs can be performant • Yes, even mega-to-huge joins! • This led us to a new relational architecture: • A parent table, atomic.events, containing our old legacy “full-fat” definition • Child tables containing individual JSONs representing new event types or bundles of context describing the event
  • 17. Our new relational approach for Redshift • A typical Snowplow deployment in Redshift now looks like this: • In fact, the first thing a Snowplow analyst often does is “re-build” in a SQL view a company-specific “full-fat” table by JOINing in all their child tables
  • 18. We built a custom process to perform safe shredding of JSONs into dedicated Redshift tables
  • 19. This is working well – but there is a lot of room for improvement • Our shredding process is closely tied to Redshift’s innovative COPY FROM JSON functionality: • This is Redshift-specific – so we can’t extend our shredding process to other columnar databases e.g. Vertica, Netezza • The syntax doesn’t support nested shredding – which would allow us to e.g. intelligently shred an order into line items, products, customer etc • We have to maintain copies of the JSON Paths files required by COPY FROM JSON in all AWS regions • So, we plan to port the Redshift-specific aspects of our shredding process out of COPY FROM JSON into Snowplow and Iglu
  • 20. Our data storage journey: to a mixed SQL / noSQL model
  • 21. Snowplow is re-architecting around the unified log CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks LOW LATENCY WIDE DATA Unified log COVERAGE Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > FEW DAYS’ DATA HISTORY Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs
  • 22. The unified log is Amazon Kinesis, or Apache Kafka CLOUD VENDOR / OWN DATA CENTER Search Silo SOME LOW LATENCY LOCAL LOOPS E-comm Silo CRM SAAS VENDOR #2 Email marketing ERP Silo CMS Silo SAAS VENDOR #1 NARROW DATA SILOES Streaming APIs / web hooks Unified log Archiving Hadoop < WIDE DATA COVERAGE > < FULL DATA HISTORY > Systems monitoring Eventstream HIGH LATENCY LOW LATENCY Product rec’s Ad hoc analytics Management reporting Fraud detection Churn prevention APIs • Amazon Kinesis, a hosted AWS service • Extremely similar semantics to Kafka • Apache Kafka, an append-only, distributed, ordered commit log • Developed at LinkedIn to serve as their organization’s unified log
  • 23. “Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization” [1] [1] http://kafka.apache.org/
  • 24. “if you squint a bit, you can see the whole of your organization's systems and data flows as a single distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. ” [1] [1] http://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 25. In a unified log world, Snowplow will be feeding a mix of different SQL, NoSQL and stream databases Scala Stream Collector Raw event stream Enrich Kinesis app Bad raw events stream Enriched event stream S3 Redshift S3 sink Kinesis app Redshift sink Kinesis app Snowplow Trackers = not yet released Elastic- Search sink Kinesis app DynamoDB Elastic- Search Event aggregator Kinesis app Analytics on Read (for agile exploration of event stream, ML, auditing, applying alternate models, reprocessing etc) Analytics on Write (for dashboarding, audience segmentation, RTB, etc)
  • 26. We have already experimented with Neo4J for customer flow/path analysis [1] [1] http://snowplowanalytics.com/blog/2014/07/31/ using-graph-databases-to-perform-pathing-analysis-initial-experimentation-with-neo4j/
  • 27. During our current work integrating Elasticsearch we discovered that common “NoSQL” databases need schemas too • A simple example of schemas in Elasticsearch: $ curl -XPUT 'http://localhost:9200/blog/contra/4' -d '{"t": ["u", 999]}' {"_index":"blog","_type":"contra","_id":"4","_version":1,"c reated":true} $ curl -XPUT 'http://localhost:9200/blog/contra/4' -d '{"p": [11, "q"]}' {"error":"MapperParsingException[failed to parse [p]]; nested: NumberFormatException[For input string: "q"]; ","status":400} • Elasticsearch is doing automated “shredding” of incoming JSONs to index that data in Lucene
  • 28. We are now working on our second shredder  • Our Elasticsearch loader contains code to shred our events’ heterogeneous JSON arrays and dictionaries into a format that is compatible with Elasticsearch • This is conceptually a much simpler shredder than the one we had to build for Redshift • When we add Google BigQuery support, we will need to write yet another shredder to handle the specifics of that data store • Hopefully we can unify and generalize our shredding technology so it works across columnar, relational, document and graph databases – a big undertaking but super powerful!
  • 29. Questions? Discount code: ulogprugcf (43% off Unified Log Processing eBook) http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To meet up or chat, @alexcrdean on Twitter or alex@snowplowanalytics.com