Hands-On: Managing Slowly Changing Dimensions Using TD Workflow

Treasure Data Hands-On: Managing Slowly
Changing Dimensions Using TD Workflow

Agenda
● Introduction
● Treasure Data Workflow
● Overview of Slowly Changing Dimensions
● Window Functions
● Handling Type 2 SCDs using Treasure Data

Introduction
• Scott Mitchell
• Senior Solution Engineer
• Work with Enterprise clients to
maximize the activation of the
client data
• smitchell@treasure-data.com

Introduction
Treasure Data is a Customer Data Platform
“Customer Data Platform (CDP) is a marketer-based management system
that creates a persistent, unified customer database that is accessible to
other systems. Data is pulled from multiple sources, cleaned, and combined
to create a single customer view. This structured data is then made available
to other marketing systems. CDP provides real-time segmentation for
sophisticated personalized marketing.”
https://en.wikipedia.org/wiki/Customer_Data_Platform

Our Customer Data Platform: Foundation
Data Management
1st party data
(Your data)
● Web
● Mobile
● Apps
● CRMs
● Offline
2nd & 3rd party DMPs
(enrichment)
Tool Integration
● Campaigns
● Advertising
● Social media
● Reporting
● BI & data
science
ID Unification
Persistent Storage
Workflow Orchestration
ActivationAll Your Data
Segmentation
Profiles Segments
Measurement

DATA ORCHESTRATION AND WORKFLOW MANAGEMENT
•Workflow management across data input, processing and output
•Supports both scheduled & trigger-based execution
•Cloud-based and Client-hosted. Client-hosted version can run custom code.
•Cloud-based version has both web UI & REST API
The core engine is built on our open source project
Digdag

Treasure Workflow allow users to build repeatable data processing pipelines that consist of
Treasure Data jobs.
Overview

Why use Treasure Workflow?
1. Enhanced Organization
• Organize your processing workflows into groups of similarly-purposed tasks
2. Reduce Errors
• No longer must manage dependencies by scheduled-time alone
3. Ease Error Handling
• Split large scripts & queries into smaller, more manageable, jobs
4. Improve Collaboration
• Organize your job flows into projects
Benefits

WORKFLOW DEFINITION: CLOSER LOOK
timezone: Asia/Tokyo
schedule:
daily>: 07:00:00
_export:
td:
database: nishi
+load:
td_load>: import/s3_load.yml
database: nishi
table: monthly_goods_sales
+daily:
td>: queries/daily_open.sql
create_table: daily_open
+monthly:
td>: queries/monthly_open.sql
result_connection: nishi_s3
result_settings:
bucket: nishitetsu-test
path: /monthly_open.csv
•File extension should be “.dig” ‘to be
recognized as workflow
•Standard YAML
•Task names are prefixed by “+”
•Operators are postfixed by “>”
•Schedules can be set with schedule
•Variables are supported via ${variable_name}

REPRESENTATIVE OPERATORS
Category Name Description
Control Flow
call>: Call another workflow
loop>: Repeat tasks a specified # of times
for_each>: Loop through a specified list
if>: if/else control flow
Treasure Data
td>: Run a specified TD query
td_run>: Run a saved query
td_ddl>: Create, delete, rename, truncate tables
td_load>: Invoke an input data transfer
td_for_each>: Loop through a query result row by row
AWS
s3_wait>: Wait for new files in S3 & download
redshift>: Run Redshift query
redshift_load>: Load data into Redshift
redshift_unload>: Unload data from Redshift
Google Cloud Platform
bq>: Run BigQuery query
bq_extract>: Unload data from BigQuery to GCS

Slowly Changing Dimensions
• Particular dimensions within a dataset that are prone to change
unpredictably
• Example: the phone number or email field of a CRM dataset
• Data available from a CRM usually represents the current, up-to-date value
of each field for each customer
• Storing a history this customer data requires managing these slowly
changing dimensions (SCDs)

Different Ways to Handle SCDs
• Type 1
• Type 2
• Type 3
• Type 4

Type 1: Overwrite the field
company_id company_name company_state
123 Sterling Cooper New York
Old Record:

Old Record:
New Record:
123 Sterling Cooper California

Old Record:
New Record:
SCD Type 1:

Type 2: Keep both records, flag the “current” row
Old Record:
New Record:
SCD Type 2:
company_id company_name company_state is_current
123 Sterling Cooper New York 0
123 Sterling Cooper California 1

Type 3: Store the latest two values in one row
Old Record:
New Record:
SCD Type 3:
company_id company_name company_state_current company_state_previous
123 Sterling Cooper California New York

Type 4: Use a separate history table
SCD Type 4:
company
company_id company_name company_state last_modified_date
123 Sterling Cooper New York 2007-06-19
123 Sterling Cooper California 2008-10-12
company_history

Type 2: Keep both records, flag the “current” row
company_id company_name company_state lastmodifieddate
Old Record:
New Record:
SCD Type 2:
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
123 Sterling Cooper California 2008-10-12 1

Window Functions
• Window functions perform calculations across rows of the query result
• They run after the ‘HAVING’ clause but before the ‘ORDER BY’ clause
• They are written in the ‘SELECT’ clause and display results in their own
column
• They have three parts:

Window Functions
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate DESC)
ordering specificationfunction partition specification

Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate DESC) AS isCurrent
FROM company
company

Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate DESC) AS isCurrent
FROM company
124 CGC Connecticut 2018-05-22 1
124 CGC New York 2010-08-22 2

Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate ASC) AS isCurrent
FROM company
124 CGC New York 2010-08-22 1

Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate ASC) AS isCurrent
FROM company
124 CGC New York 2010-08-22 0

Window Functions
SELECT
company_id,
company_name,
company_state,
CASE WHEN rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate ASC) = 1 THEN 1 ELSE 0 AS END as isCurrent
FROM company
124 CGC New York 2010-08-22 0

Implementation in Treasure Data
1. Load incremental data from a data source to a staging table
1. Drop the target table that contains outdated SCD information
1. Window over the staging table, rebuilding the target table with the latest
SCD information

124 CGC New York 2010-08-22
staging_company
124 CGC New York 2010-08-22 1
target_company

124 CGC New York 2010-08-22
124 CGC Connecticut 2018-05-22
staging_company
124 CGC New York 2010-08-22 1
target_company

124 CGC New York 2010-08-22
staging_company
target_company

124 CGC New York 2010-08-22
staging_company
124 CGC New York 2010-08-22 0
target_company

SCD Type 2 Workflow with Persistent Architecture
staging_company
124 CGC New York 2010-08-22 1
target_company

staging_company
124 CGC New York 2010-08-22 1
target_company
1. Store a temp table of the current rows that will not be current after the new data is
ingested
tmp_no_longer_current

staging_company
124 CGC New York 2010-08-22 1
target_company
2. Delete from the data lake any current rows that have a matching id in the new data

staging_company
124 CGC New York 2010-08-22 1
target_company
3. Insert the temp rows into the target table

staging_company
124 CGC New York 2010-08-22 1
target_company
4. Insert the new data into the target table

Contact Information
• Scott Mitchell
• Senior Solution Engineer
• smitchell@treasure-data.com

Hands-On: Managing Slowly Changing Dimensions Using TD Workflow

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hands-On: Managing Slowly Changing Dimensions Using TD Workflow

Semelhante a Hands-On: Managing Slowly Changing Dimensions Using TD Workflow (20)

Mais de Treasure Data, Inc.

Mais de Treasure Data, Inc. (20)

Último

Último (20)

Hands-On: Managing Slowly Changing Dimensions Using TD Workflow