This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
3. Outline
• About Myself
• About Abebooks
• Choosing ETL for the Cloud
• Data Acquisition Patterns with Matillion ETL
• Set Self-Service BI
• Lessons Learned during the journey to the Cloud
4. About Myself
• Work with BI since 2007
• Implemented BI in Russia/Europe/Canada
8. About Abebooks
• Online marketplace for books, art & collectibles.
• Amazon subsidiary since 2008 we are a
marketplace for used books and increasingly non-
book-collectibles
• 350 Mln listings
• 3 in ‘DB Team’
• 2 locations: Victoria, BC and Dusseldorf
9.
10. Abebooks Data Flows
• Built by DBAs - db links, PL/SQL, external tables, shell scripts
• even before 2015 Redshift was a strategic but ETL re-write too expensive
DW
Storage Layer Access LayerSource Layer
ETL (PL/SQL)
Ad-hoc SQL
SALES
INVENTORY
CS
SFTP
11. Choosing ETL Tool for Cloud
Use Cases
• OLTP to S3
• S3 to Redshift
• SFTP/API to Redshift
• Data Transformation
• Dimensional Modelling
Tools
• Pentaho DI
• Informatica
• AWD Data Pipeline
• Talend
• Matillion
12. ETL Criteria
High:
• Support native
Redshift driver
• Easily capture
from relational
db, CDC
• Ease of Use for
BI/DW
• Cover use cases
• On-Premise
Medium:
• Support NoSQL
• Company “Winner”
• Deployment/Architecture
• Encryption
• Ease of Use for non BI/DW
• Data Transformations
• Management
• Pricing
• Performance
Low:
• Version Control
• Linux OS
• ETL Monitoring
• Logging
• R/Pyhton
13. Why We Picked Matillion
• specific redshift support, built around Redshift platform
• speed of ETL operations
• speed of development
• wide range of data sources supported
• ease of use outside of DE/DBA expertise
• Native with AWS
• $$$
• The biggest risk – putting our eggs in the Matillion future, betting on a small and
new player.
15. Abebooks Cloud Analytics Architecture
Source Systems
Amazon
Athena
Amazon EMR
Amazon
Redshift
Abebooks DW Account
DynamoDB
Amazon
RDS
Amazon
Redshift
Spectrum
Amazon
Elastic Load
Balance
S3 Data Lake
SQS SNS
Amazon
Chime
Event/Notification ServicesExternal API
SFTP
APPs
Matillion ELT EC2
M4.large
2 vCPU
8 Gb Ram
Tableau Server
Tableau Web
Tableau Desktop
Ad-hock SQL
End Users Access
16. Pattern 1: getting data via SFTP
• Scan SFTP, get all files names, load into Redshift
• Identify only new files
• Load one ${file_name} per time (using IF we can
choose right stream)
• Insert processed ${file_name} into Redshift
• Load next file
Takeaways:
• Python BOTO library for managing S3
• Matillion variables ${variable}
• Using Matillion Iterators
• Execute SQL via Python
• If file is missing, try again later
17. Pattern 2: getting data via API
• Connect API via Python script
• Get data via calls and save to CSV at EC2
• Upload CSV into S3
• Load CSV into Redshift
Takeaways:
• Using Python to connect external API
• Using AWS KMS to encrypt credentials
• Using SNS for email notification
• Using Matillion system variable for ETL
Logs
18. Pattern 3: getting data from DynamoDB
Takeaways:
• Using DynamoDB component (generate COPY command for you)
• You can’t easily get incremental changes, i.e. full reload
• Speed depends depends on two things, the "read ratio" and the per-table "read
capacity". The actual rows per hour value is going to be based on readRatio *
tableReadCapacity.
• 51m rows with 35% read ratio and 300 read capacity = 9 hours
• 211m rows with 66% read ratio and 1500 read capacity = 4 hours
• Reloading once a week
19. Pattern 4: getting data from external S3*
Getting data from another VPC – change policy of the bucket and you can see it in the
list of buckets through Matillion
23. BI Survey
• ETL was a black box
• A lack of notifications
• A lack of documentation and trainings
• A lack of automation
• No dependency between reports and ETL process
• High dependency from BI/DW team
24. BI Champions
The BI champion is the sheriff, ensuring the townspeople (or business users) be
productive and can make analytics fast and smoothly.
The BI Champion is meant to be both an
evangelist and subject matter expert for BI
within the organization. The champion should
be well versed in the data important to their
team, and knowledgeable in the core BI
technologies and patterns used within
AbeBooks.
25. ETL Monitor and notifications
SNS Topic will send
email. In addition we can
add any number of
Matillion variables
Using Amazon Chime
Webhook we can
execute CURL command
via bash script and send
message to the business
users
26. ETL Monitor
Using Matillion system variables we are tracking all events and then visualize via Tableau for end users as well as
allow to create alerts in case of failure.
27. ETL Trigger for Tableau
Task: Refresh Tableau Data Source (Semantic Layer) & Workbooks when FACT tables are refreshed.
Solution: Deploy Tableau CLI tool on EC2 Matillion and run via Bash Script
28. Self-Service BI
• Change Management: from report-writing culture to data-driven company
• The clear Authority: Support of Executive
• The analytic culture: Business executives must have a vision for analytics and the willingness to invest in the
people, processes, and technologies for the long haul to ensure a successful outcome.
• The right people (data engineers, BI engineers, business analysts)
• The right organizational structure: BI Center of Excellence, that establishes and inculcates best practices for
building analytical applications
• The right data and architecture
• The right tools: Redshift, Matillion and Tableau are best for Self-Serve
29. Report Automatization
• Central BI Portal
• Reusable Tableau Data Sources a.k.a. Business Layer
• Common WBR Format
• Eliminate manual work
• No spreadsheets and ad-hoc SQL queries
• Data Discovery
• ETL Integration
• Friendly drag and drop GUI
TL;DR: CTRL+C, CTRL+V, IT dependency
• Lots of SQL and Excel routine
• Each team define own style and format of report
• Multiple metrics definition
• No visualization, no alerts
• Slow data discovery, hypothesis evaluation
31. Five Points of Guidance for Redshift (SET DW)
1. Sort Keys:
• Choose up to 3 columns
• Ordered in increasing order of specificity, balanced with likelihood of use.
• Leave INTERLEAVED sort keys for 1 year anniversary.
2. Column Encoding:
• Compress all columns except for (at least) the first sort key.
3. Table Maintenance:
• VACUUM and ANALYZE tables weekly (use STL_ALERT_EVENT_LOG as a guide for frequency).
• ANALYZE PREDICATE COLUMNS is very useful for quick daily stats refresh.
4. Choose a Distribution Key that:
• Follows the common join pattern for the table.
• Evenly distributes the data across the database slices on the cluster.
• DISTSTYLE ALL is a great go-to for dimension tables < ~3 million rows.
• DISTSTYLE EVEN is a good fail-safe, but guarantees inter-node data redistribution.
5. Workload Management (WLM) and Query Monitoring Rules (QMR):
• Start with up to 3 queues, (in addition to what Redshift provides automatically).
• Put ETL in its own queue with very low active_statement count (perhaps as low as 1 or 2). Monitor commit queuing.
• Split up the memory across the queues. Monitor the percent of each queue’s workload going to disk.
• Expect to change WLM settings to match the workload changes (day|night, weekday|weekend)
32. Lesson One. CHOOSE RIGHT MIGRATION STRATEGY
Lift & Shift
• Typical Approach
• Move all-at-once
• Target platform then evolve
• Approach gets you to the cloud quickly
• Relatively small barrier to learning new
technology since it tends to be a close fit
33. Lesson One. CHOOSE RIGHT MIGRATION STRATEGY
Split & Flip
• Split application into logical functional
data layers
• Match the data functionality with the
right technology
• Leverage the wide selection of tools
on AWS to best fit the need
• Move data in phases — prototype,
learn and perfect
34. Lesson Two. CHANGE YOUR MINDSET
Take the time to learn
• Critical to train and learn the new technologies that
are being used
• Easy to think about translating or converting
• Made many such changes — relational vs non-
relational, batch vs streaming, service based vs
procedural, etc.
35. Lesson Two. CHANGE YOUR MINDSET
Traditional DW — faster runtime is better
Cloud — if runtime is slower, it is easy to scale
Reality
Query #1 uses 64 cores & Query #2 uses 1 core
Practical limitation to scale — fixed budget
#1 RUNS IN 1 MIN
RUNS IN 2 MINS
DB
DB#2
36. Lesson Two. CHANGE YOUR MINDSET
We Optimized For Cost in RedShift
• What is the most amount of work that can be done using the given
fixed budget?
• Focus is on the total amount of work versus optimizing for a single
user
• Everything you use comes at a cost on the Cloud
DynomoDB performance
Redshift vs Spectrum (S3)
Cost is just one example of the many mindset changes that we made
37. Lesson Three. DO NOT SCARRY OPEN BLACK BOX
• All business logic is hidden in legacy ETL scripts
• Tradeoff between fast project and business users
expectation
• Learn about your business
• Discover and fix the issues
38. Lesson Four. BE AGILE AND INVOLVE BUSINESS
Agile Benefits
• See results earlier
• Feedback Constantly
• Serves your users
• Flexibility
• Quality Assurance
39. Lesson Five. PLAN YOUR EVOLUTION
Handling Less Efficient Queries
• Provide separate cluster as a SandBox
• App Developers design new queries that will fit the
constraints of a hands-off operations
Example.
Create roll-up summary
tables in RedShift
SUMMARY
TABLE
company a 'winner'
will this tool be supported and fully usable in 3-5 years
will this be adopted by Amazon, will there be a community of use
recommendations within Amazon (such as AWS SA)
years in business, customers, profitability
management- scheduling built in- intuitive views of DW processes, models, schedules- does it help someone understand DW data flows
deployment / architectures- AWS better than local- linux better than windows- must be patchable platform within Amazon guideline
Biggest risk was the investment in a tool from a small player
Porting ETL processes from Matillion would be no less expensive than from PL/SQL and dblinks