Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Building Cloud Self-Service
Analytical Solutions
By Dmitry Anoshin, Data Engineer, Abebooks (Amazon Subsidiary)

Outline
• About Myself
• About Abebooks
• Choosing ETL for the Cloud
• Data Acquisition Patterns with Matillion ETL
• Set Self-Service BI
• Lessons Learned during the journey to the Cloud

About Myself
• Work with BI since 2007
• Implemented BI in Russia/Europe/Canada

Technical Skills Matrix
2015
2010
2007
Databases
(Oracle,
Teradata,
Vertica,
Snowflake,
Redshift,
Mysql,
Postgresql,
MS SQL
Server)
ETL
(Pentaho DI,
Informatica,
Matillion
ETL)
BI
(SAP
BusinessObje
cts, Tableau,
Microstrateg
y, Pentaho
BI, SAS BI)
Bigdata
(Cloudera
Hadoop, Hive,
Hue,
Splunk, Hunk,
ElasticSearch)
Digital
Marketing
(GA, Piwik,
Tealium,
Adjust,
Adobe,)
Data
Analytics
(R, Python)
2018

About Abebooks
• Online marketplace for books, art & collectibles.
• Amazon subsidiary since 2008 we are a
marketplace for used books and increasingly non-
book-collectibles
• 350 Mln listings
• 3 in ‘DB Team’
• 2 locations: Victoria, BC and Dusseldorf

Abebooks Data Flows
• Built by DBAs - db links, PL/SQL, external tables, shell scripts
• even before 2015 Redshift was a strategic but ETL re-write too expensive
DW
Storage Layer Access LayerSource Layer
ETL (PL/SQL)
Ad-hoc SQL
SALES
INVENTORY
CS
SFTP

Choosing ETL Tool for Cloud
Use Cases
• OLTP to S3
• S3 to Redshift
• SFTP/API to Redshift
• Data Transformation
• Dimensional Modelling
Tools
• Pentaho DI
• Informatica
• AWD Data Pipeline
• Talend
• Matillion

ETL Criteria
High:
• Support native
Redshift driver
• Easily capture
from relational
db, CDC
• Ease of Use for
BI/DW
• Cover use cases
• On-Premise
Medium:
• Support NoSQL
• Company “Winner”
• Deployment/Architecture
• Encryption
• Ease of Use for non BI/DW
• Data Transformations
• Management
• Pricing
• Performance
Low:
• Version Control
• Linux OS
• ETL Monitoring
• Logging
• R/Pyhton

Why We Picked Matillion
• specific redshift support, built around Redshift platform
• speed of ETL operations
• speed of development
• wide range of data sources supported
• ease of use outside of DE/DBA expertise
• Native with AWS
• $$$
• The biggest risk – putting our eggs in the Matillion future, betting on a small and
new player.

Data acquisition
patterns with
Matillion ELT

Abebooks Cloud Analytics Architecture
Source Systems
Amazon
Athena
Amazon EMR
Amazon
Redshift
Abebooks DW Account
DynamoDB
Amazon
RDS
Amazon
Redshift
Spectrum
Amazon
Elastic Load
Balance
S3 Data Lake
SQS SNS
Amazon
Chime
Event/Notification ServicesExternal API
SFTP
APPs
Matillion ELT EC2
M4.large
2 vCPU
8 Gb Ram
Tableau Server
Tableau Web
Tableau Desktop
Ad-hock SQL
End Users Access

Pattern 1: getting data via SFTP
• Scan SFTP, get all files names, load into Redshift
• Identify only new files
• Load one ${file_name} per time (using IF we can
choose right stream)
• Insert processed ${file_name} into Redshift
• Load next file
Takeaways:
• Python BOTO library for managing S3
• Matillion variables ${variable}
• Using Matillion Iterators
• Execute SQL via Python
• If file is missing, try again later

Pattern 2: getting data via API
• Connect API via Python script
• Get data via calls and save to CSV at EC2
• Upload CSV into S3
• Load CSV into Redshift
Takeaways:
• Using Python to connect external API
• Using AWS KMS to encrypt credentials
• Using SNS for email notification
• Using Matillion system variable for ETL
Logs

Pattern 3: getting data from DynamoDB
Takeaways:
• Using DynamoDB component (generate COPY command for you)
• You can’t easily get incremental changes, i.e. full reload
• Speed depends depends on two things, the "read ratio" and the per-table "read
capacity". The actual rows per hour value is going to be based on readRatio *
tableReadCapacity.
• 51m rows with 35% read ratio and 300 read capacity = 9 hours
• 211m rows with 66% read ratio and 1500 read capacity = 4 hours
• Reloading once a week

Pattern 4: getting data from external S3*
Getting data from another VPC – change policy of the bucket and you can see it in the
list of buckets through Matillion

Pattern 5: Matillion connectors for Apps

Pattern 6: Using SQS for Triggering Job
Using SQS service we can trigger almost anything in Matillion or AWS

Improving end
users experience

BI Survey
• ETL was a black box
• A lack of notifications
• A lack of documentation and trainings
• A lack of automation
• No dependency between reports and ETL process
• High dependency from BI/DW team

BI Champions
The BI champion is the sheriff, ensuring the townspeople (or business users) be
productive and can make analytics fast and smoothly.
The BI Champion is meant to be both an
evangelist and subject matter expert for BI
within the organization. The champion should
be well versed in the data important to their
team, and knowledgeable in the core BI
technologies and patterns used within
AbeBooks.

ETL Monitor and notifications
SNS Topic will send
email. In addition we can
add any number of
Matillion variables
Using Amazon Chime
Webhook we can
execute CURL command
via bash script and send
message to the business
users

ETL Monitor
Using Matillion system variables we are tracking all events and then visualize via Tableau for end users as well as
allow to create alerts in case of failure.

ETL Trigger for Tableau
Task: Refresh Tableau Data Source (Semantic Layer) & Workbooks when FACT tables are refreshed.
Solution: Deploy Tableau CLI tool on EC2 Matillion and run via Bash Script

Self-Service BI
• Change Management: from report-writing culture to data-driven company
• The clear Authority: Support of Executive
• The analytic culture: Business executives must have a vision for analytics and the willingness to invest in the
people, processes, and technologies for the long haul to ensure a successful outcome.
• The right people (data engineers, BI engineers, business analysts)
• The right organizational structure: BI Center of Excellence, that establishes and inculcates best practices for
building analytical applications
• The right data and architecture
• The right tools: Redshift, Matillion and Tableau are best for Self-Serve

Report Automatization
• Central BI Portal
• Reusable Tableau Data Sources a.k.a. Business Layer
• Common WBR Format
• Eliminate manual work
• No spreadsheets and ad-hoc SQL queries
• Data Discovery
• ETL Integration
• Friendly drag and drop GUI
TL;DR: CTRL+C, CTRL+V, IT dependency
• Lots of SQL and Excel routine
• Each team define own style and format of report
• Multiple metrics definition
• No visualization, no alerts
• Slow data discovery, hypothesis evaluation

Lessons Learned
from moving DW into
AWS (Cloud)

Five Points of Guidance for Redshift (SET DW)
1. Sort Keys:
• Choose up to 3 columns
• Ordered in increasing order of specificity, balanced with likelihood of use.
• Leave INTERLEAVED sort keys for 1 year anniversary.
2. Column Encoding:
• Compress all columns except for (at least) the first sort key.
3. Table Maintenance:
• VACUUM and ANALYZE tables weekly (use STL_ALERT_EVENT_LOG as a guide for frequency).
• ANALYZE PREDICATE COLUMNS is very useful for quick daily stats refresh.
4. Choose a Distribution Key that:
• Follows the common join pattern for the table.
• Evenly distributes the data across the database slices on the cluster.
• DISTSTYLE ALL is a great go-to for dimension tables < ~3 million rows.
• DISTSTYLE EVEN is a good fail-safe, but guarantees inter-node data redistribution.
5. Workload Management (WLM) and Query Monitoring Rules (QMR):
• Start with up to 3 queues, (in addition to what Redshift provides automatically).
• Put ETL in its own queue with very low active_statement count (perhaps as low as 1 or 2). Monitor commit queuing.
• Split up the memory across the queues. Monitor the percent of each queue’s workload going to disk.
• Expect to change WLM settings to match the workload changes (day|night, weekday|weekend)

Lesson One. CHOOSE RIGHT MIGRATION STRATEGY
Lift & Shift
• Typical Approach
• Move all-at-once
• Target platform then evolve
• Approach gets you to the cloud quickly
• Relatively small barrier to learning new
technology since it tends to be a close fit

Lesson One. CHOOSE RIGHT MIGRATION STRATEGY
Split & Flip
• Split application into logical functional
data layers
• Match the data functionality with the
right technology
• Leverage the wide selection of tools
on AWS to best fit the need
• Move data in phases — prototype,
learn and perfect

Lesson Two. CHANGE YOUR MINDSET
Take the time to learn
• Critical to train and learn the new technologies that
are being used
• Easy to think about translating or converting
• Made many such changes — relational vs non-
relational, batch vs streaming, service based vs
procedural, etc.

Traditional DW — faster runtime is better
Cloud — if runtime is slower, it is easy to scale
Reality
Query #1 uses 64 cores & Query #2 uses 1 core
Practical limitation to scale — fixed budget
#1 RUNS IN 1 MIN
RUNS IN 2 MINS
DB
DB#2

We Optimized For Cost in RedShift
• What is the most amount of work that can be done using the given
fixed budget?
• Focus is on the total amount of work versus optimizing for a single
user
• Everything you use comes at a cost on the Cloud
 DynomoDB performance
 Redshift vs Spectrum (S3)
Cost is just one example of the many mindset changes that we made

Lesson Three. DO NOT SCARRY OPEN BLACK BOX
• All business logic is hidden in legacy ETL scripts
• Tradeoff between fast project and business users
expectation
• Learn about your business
• Discover and fix the issues

Lesson Four. BE AGILE AND INVOLVE BUSINESS
Agile Benefits
• See results earlier
• Feedback Constantly
• Serves your users
• Flexibility
• Quality Assurance

Lesson Five. PLAN YOUR EVOLUTION
Handling Less Efficient Queries
• Provide separate cluster as a SandBox
• App Developers design new queries that will fit the
constraints of a hands-off operations
Example.
Create roll-up summary
tables in RedShift
SUMMARY
TABLE

Q&A
Contact details: anoshind@amazon.com

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Semelhante a Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution (20)

Mais de Dmitry Anoshin

Mais de Dmitry Anoshin (20)

Último

Último (20)

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Notas do Editor