SlideShare uma empresa Scribd logo
1 de 40
Building Cloud Self-Service
Analytical Solutions
By Dmitry Anoshin, Data Engineer, Abebooks (Amazon Subsidiary)
Outline
• About Myself
• About Abebooks
• Choosing ETL for the Cloud
• Data Acquisition Patterns with Matillion ETL
• Set Self-Service BI
• Lessons Learned during the journey to the Cloud
About Myself
• Work with BI since 2007
• Implemented BI in Russia/Europe/Canada
Technical Skills Matrix
2015
2010
2007
Databases
(Oracle,
Teradata,
Vertica,
Snowflake,
Redshift,
Mysql,
Postgresql,
MS SQL
Server)
ETL
(Pentaho DI,
Informatica,
Matillion
ETL)
BI
(SAP
BusinessObje
cts, Tableau,
Microstrateg
y, Pentaho
BI, SAS BI)
Bigdata
(Cloudera
Hadoop, Hive,
Hue,
Splunk, Hunk,
ElasticSearch)
Digital
Marketing
(GA, Piwik,
Tealium,
Adjust,
Adobe,)
Data
Analytics
(R, Python)
2018
My Books
#dimaworkplace
About Abebooks
• Online marketplace for books, art & collectibles.
• Amazon subsidiary since 2008 we are a
marketplace for used books and increasingly non-
book-collectibles
• 350 Mln listings
• 3 in ‘DB Team’
• 2 locations: Victoria, BC and Dusseldorf
Abebooks Data Flows
• Built by DBAs - db links, PL/SQL, external tables, shell scripts
• even before 2015 Redshift was a strategic but ETL re-write too expensive
DW
Storage Layer Access LayerSource Layer
ETL (PL/SQL)
Ad-hoc SQL
SALES
INVENTORY
CS
SFTP
Choosing ETL Tool for Cloud
Use Cases
• OLTP to S3
• S3 to Redshift
• SFTP/API to Redshift
• Data Transformation
• Dimensional Modelling
Tools
• Pentaho DI
• Informatica
• AWD Data Pipeline
• Talend
• Matillion
ETL Criteria
High:
• Support native
Redshift driver
• Easily capture
from relational
db, CDC
• Ease of Use for
BI/DW
• Cover use cases
• On-Premise
Medium:
• Support NoSQL
• Company “Winner”
• Deployment/Architecture
• Encryption
• Ease of Use for non BI/DW
• Data Transformations
• Management
• Pricing
• Performance
Low:
• Version Control
• Linux OS
• ETL Monitoring
• Logging
• R/Pyhton
Why We Picked Matillion
• specific redshift support, built around Redshift platform
• speed of ETL operations
• speed of development
• wide range of data sources supported
• ease of use outside of DE/DBA expertise
• Native with AWS
• $$$
• The biggest risk – putting our eggs in the Matillion future, betting on a small and
new player.
Data acquisition
patterns with
Matillion ELT
Abebooks Cloud Analytics Architecture
Source Systems
Amazon
Athena
Amazon EMR
Amazon
Redshift
Abebooks DW Account
DynamoDB
Amazon
RDS
Amazon
Redshift
Spectrum
Amazon
Elastic Load
Balance
S3 Data Lake
SQS SNS
Amazon
Chime
Event/Notification ServicesExternal API
SFTP
APPs
Matillion ELT EC2
M4.large
2 vCPU
8 Gb Ram
Tableau Server
Tableau Web
Tableau Desktop
Ad-hock SQL
End Users Access
Pattern 1: getting data via SFTP
• Scan SFTP, get all files names, load into Redshift
• Identify only new files
• Load one ${file_name} per time (using IF we can
choose right stream)
• Insert processed ${file_name} into Redshift
• Load next file
Takeaways:
• Python BOTO library for managing S3
• Matillion variables ${variable}
• Using Matillion Iterators
• Execute SQL via Python
• If file is missing, try again later
Pattern 2: getting data via API
• Connect API via Python script
• Get data via calls and save to CSV at EC2
• Upload CSV into S3
• Load CSV into Redshift
Takeaways:
• Using Python to connect external API
• Using AWS KMS to encrypt credentials
• Using SNS for email notification
• Using Matillion system variable for ETL
Logs
Pattern 3: getting data from DynamoDB
Takeaways:
• Using DynamoDB component (generate COPY command for you)
• You can’t easily get incremental changes, i.e. full reload
• Speed depends depends on two things, the "read ratio" and the per-table "read
capacity". The actual rows per hour value is going to be based on readRatio *
tableReadCapacity.
• 51m rows with 35% read ratio and 300 read capacity = 9 hours
• 211m rows with 66% read ratio and 1500 read capacity = 4 hours
• Reloading once a week
Pattern 4: getting data from external S3*
Getting data from another VPC – change policy of the bucket and you can see it in the
list of buckets through Matillion
Pattern 5: Matillion connectors for Apps
Pattern 6: Using SQS for Triggering Job
Using SQS service we can trigger almost anything in Matillion or AWS
Improving end
users experience
BI Survey
• ETL was a black box
• A lack of notifications
• A lack of documentation and trainings
• A lack of automation
• No dependency between reports and ETL process
• High dependency from BI/DW team
BI Champions
The BI champion is the sheriff, ensuring the townspeople (or business users) be
productive and can make analytics fast and smoothly.
The BI Champion is meant to be both an
evangelist and subject matter expert for BI
within the organization. The champion should
be well versed in the data important to their
team, and knowledgeable in the core BI
technologies and patterns used within
AbeBooks.
ETL Monitor and notifications
SNS Topic will send
email. In addition we can
add any number of
Matillion variables
Using Amazon Chime
Webhook we can
execute CURL command
via bash script and send
message to the business
users
ETL Monitor
Using Matillion system variables we are tracking all events and then visualize via Tableau for end users as well as
allow to create alerts in case of failure.
ETL Trigger for Tableau
Task: Refresh Tableau Data Source (Semantic Layer) & Workbooks when FACT tables are refreshed.
Solution: Deploy Tableau CLI tool on EC2 Matillion and run via Bash Script
Self-Service BI
• Change Management: from report-writing culture to data-driven company
• The clear Authority: Support of Executive
• The analytic culture: Business executives must have a vision for analytics and the willingness to invest in the
people, processes, and technologies for the long haul to ensure a successful outcome.
• The right people (data engineers, BI engineers, business analysts)
• The right organizational structure: BI Center of Excellence, that establishes and inculcates best practices for
building analytical applications
• The right data and architecture
• The right tools: Redshift, Matillion and Tableau are best for Self-Serve
Report Automatization
• Central BI Portal
• Reusable Tableau Data Sources a.k.a. Business Layer
• Common WBR Format
• Eliminate manual work
• No spreadsheets and ad-hoc SQL queries
• Data Discovery
• ETL Integration
• Friendly drag and drop GUI
TL;DR: CTRL+C, CTRL+V, IT dependency
• Lots of SQL and Excel routine
• Each team define own style and format of report
• Multiple metrics definition
• No visualization, no alerts
• Slow data discovery, hypothesis evaluation
Lessons Learned
from moving DW into
AWS (Cloud)
Five Points of Guidance for Redshift (SET DW)
1. Sort Keys:
• Choose up to 3 columns
• Ordered in increasing order of specificity, balanced with likelihood of use.
• Leave INTERLEAVED sort keys for 1 year anniversary.
2. Column Encoding:
• Compress all columns except for (at least) the first sort key.
3. Table Maintenance:
• VACUUM and ANALYZE tables weekly (use STL_ALERT_EVENT_LOG as a guide for frequency).
• ANALYZE PREDICATE COLUMNS is very useful for quick daily stats refresh.
4. Choose a Distribution Key that:
• Follows the common join pattern for the table.
• Evenly distributes the data across the database slices on the cluster.
• DISTSTYLE ALL is a great go-to for dimension tables < ~3 million rows.
• DISTSTYLE EVEN is a good fail-safe, but guarantees inter-node data redistribution.
5. Workload Management (WLM) and Query Monitoring Rules (QMR):
• Start with up to 3 queues, (in addition to what Redshift provides automatically).
• Put ETL in its own queue with very low active_statement count (perhaps as low as 1 or 2). Monitor commit queuing.
• Split up the memory across the queues. Monitor the percent of each queue’s workload going to disk.
• Expect to change WLM settings to match the workload changes (day|night, weekday|weekend)
Lesson One. CHOOSE RIGHT MIGRATION STRATEGY
Lift & Shift
• Typical Approach
• Move all-at-once
• Target platform then evolve
• Approach gets you to the cloud quickly
• Relatively small barrier to learning new
technology since it tends to be a close fit
Lesson One. CHOOSE RIGHT MIGRATION STRATEGY
Split & Flip
• Split application into logical functional
data layers
• Match the data functionality with the
right technology
• Leverage the wide selection of tools
on AWS to best fit the need
• Move data in phases — prototype,
learn and perfect
Lesson Two. CHANGE YOUR MINDSET
Take the time to learn
• Critical to train and learn the new technologies that
are being used
• Easy to think about translating or converting
• Made many such changes — relational vs non-
relational, batch vs streaming, service based vs
procedural, etc.
Lesson Two. CHANGE YOUR MINDSET
Traditional DW — faster runtime is better
Cloud — if runtime is slower, it is easy to scale
Reality
Query #1 uses 64 cores & Query #2 uses 1 core
Practical limitation to scale — fixed budget
#1 RUNS IN 1 MIN
RUNS IN 2 MINS
DB
DB#2
Lesson Two. CHANGE YOUR MINDSET
We Optimized For Cost in RedShift
• What is the most amount of work that can be done using the given
fixed budget?
• Focus is on the total amount of work versus optimizing for a single
user
• Everything you use comes at a cost on the Cloud
 DynomoDB performance
 Redshift vs Spectrum (S3)
Cost is just one example of the many mindset changes that we made
Lesson Three. DO NOT SCARRY OPEN BLACK BOX
• All business logic is hidden in legacy ETL scripts
• Tradeoff between fast project and business users
expectation
• Learn about your business
• Discover and fix the issues
Lesson Four. BE AGILE AND INVOLVE BUSINESS
Agile Benefits
• See results earlier
• Feedback Constantly
• Serves your users
• Flexibility
• Quality Assurance
Lesson Five. PLAN YOUR EVOLUTION
Handling Less Efficient Queries
• Provide separate cluster as a SandBox
• App Developers design new queries that will fit the
constraints of a hands-off operations
Example.
Create roll-up summary
tables in RedShift
SUMMARY
TABLE
Q&A
Contact details: anoshind@amazon.com

Mais conteúdo relacionado

Mais procurados

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesCarole Gunst
 
Modernizing Data Management Through Metadata
Modernizing Data Management Through MetadataModernizing Data Management Through Metadata
Modernizing Data Management Through MetadataMANTA
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics SuiteJames Serra
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight OverviewLam Le
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceChristopher Foot
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BIKellyn Pot'Vin-Gorman
 
Modern Data Warehouse Overview
Modern Data Warehouse OverviewModern Data Warehouse Overview
Modern Data Warehouse OverviewJohn Chang
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfRob Winters
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 

Mais procurados (20)

Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Modernizing Data Management Through Metadata
Modernizing Data Management Through MetadataModernizing Data Management Through Metadata
Modernizing Data Management Through Metadata
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Synapse for mere mortals
Synapse for mere mortalsSynapse for mere mortals
Synapse for mere mortals
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
RDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business IntelligenceRDX Insights Presentation - Microsoft Business Intelligence
RDX Insights Presentation - Microsoft Business Intelligence
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
 
Modern Data Warehouse Overview
Modern Data Warehouse OverviewModern Data Warehouse Overview
Modern Data Warehouse Overview
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 

Semelhante a Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Big data and Analytics on AWS
Big data and Analytics on AWSBig data and Analytics on AWS
Big data and Analytics on AWS2nd Watch
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...Amazon Web Services
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
ARC202:real world real time analytics
ARC202:real world real time analyticsARC202:real world real time analytics
ARC202:real world real time analyticsSebastian Montini
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsAWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsSocialmetrix
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Clustrix
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesAlexandra Sasha Blumenfeld
 

Semelhante a Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution (20)

Big data and Analytics on AWS
Big data and Analytics on AWSBig data and Analytics on AWS
Big data and Analytics on AWS
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
ARC202:real world real time analytics
ARC202:real world real time analyticsARC202:real world real time analytics
ARC202:real world real time analytics
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time AnalyticsAWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
AWS re:Invent 2014 | (ARC202) Real-World Real-Time Analytics
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
 

Mais de Dmitry Anoshin

Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...
Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...
Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...Dmitry Anoshin
 
Victoria Tableau User Group - Getting started with Tableau
Victoria Tableau User Group - Getting started with TableauVictoria Tableau User Group - Getting started with Tableau
Victoria Tableau User Group - Getting started with TableauDmitry Anoshin
 
Hey, what is about data?
Hey, what is about data?Hey, what is about data?
Hey, what is about data?Dmitry Anoshin
 
My experience of writing technical books
My experience of writing technical booksMy experience of writing technical books
My experience of writing technical booksDmitry Anoshin
 
Business objects activities web intelligence
Business objects activities web intelligenceBusiness objects activities web intelligence
Business objects activities web intelligenceDmitry Anoshin
 
Splunk 6.2 new features
Splunk 6.2 new featuresSplunk 6.2 new features
Splunk 6.2 new featuresDmitry Anoshin
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm ChangeDmitry Anoshin
 
SAP BO and Teradata best practices
SAP BO and Teradata best practicesSAP BO and Teradata best practices
SAP BO and Teradata best practicesDmitry Anoshin
 
Splunk Digital Intelligence
Splunk Digital IntelligenceSplunk Digital Intelligence
Splunk Digital IntelligenceDmitry Anoshin
 
Role of Tableau on the Data Discovery Market
Role of Tableau on the Data Discovery MarketRole of Tableau on the Data Discovery Market
Role of Tableau on the Data Discovery MarketDmitry Anoshin
 
SAP Lumira - Building visualizations
SAP Lumira - Building visualizationsSAP Lumira - Building visualizations
SAP Lumira - Building visualizationsDmitry Anoshin
 
SAP Lumira - Acquiring data
SAP Lumira - Acquiring dataSAP Lumira - Acquiring data
SAP Lumira - Acquiring dataDmitry Anoshin
 
SAP Lumira - Enriching data
SAP Lumira - Enriching dataSAP Lumira - Enriching data
SAP Lumira - Enriching dataDmitry Anoshin
 
Microstrategy for Retailer Company
Microstrategy for Retailer CompanyMicrostrategy for Retailer Company
Microstrategy for Retailer CompanyDmitry Anoshin
 
SAP BusinessObjects 4.1 Web Intelligence Report Development
SAP BusinessObjects 4.1 Web Intelligence Report DevelopmentSAP BusinessObjects 4.1 Web Intelligence Report Development
SAP BusinessObjects 4.1 Web Intelligence Report DevelopmentDmitry Anoshin
 
Business objects web intelligence training tasks
Business objects web intelligence training tasksBusiness objects web intelligence training tasks
Business objects web intelligence training tasksDmitry Anoshin
 
Sap business objects 4 quick start manual
Sap business objects 4 quick start manualSap business objects 4 quick start manual
Sap business objects 4 quick start manualDmitry Anoshin
 

Mais de Dmitry Anoshin (20)

Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...
Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...
Cloud Analytics Use Cases and Architecture, Math Marketing Conference, Russia...
 
Victoria Tableau User Group - Getting started with Tableau
Victoria Tableau User Group - Getting started with TableauVictoria Tableau User Group - Getting started with Tableau
Victoria Tableau User Group - Getting started with Tableau
 
Hey, what is about data?
Hey, what is about data?Hey, what is about data?
Hey, what is about data?
 
Tableau API
Tableau APITableau API
Tableau API
 
My experience of writing technical books
My experience of writing technical booksMy experience of writing technical books
My experience of writing technical books
 
Business objects activities web intelligence
Business objects activities web intelligenceBusiness objects activities web intelligence
Business objects activities web intelligence
 
Splunk 6.2 new features
Splunk 6.2 new featuresSplunk 6.2 new features
Splunk 6.2 new features
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm Change
 
SAP BO and Teradata best practices
SAP BO and Teradata best practicesSAP BO and Teradata best practices
SAP BO and Teradata best practices
 
Exploring Splunk
Exploring SplunkExploring Splunk
Exploring Splunk
 
Splunk Digital Intelligence
Splunk Digital IntelligenceSplunk Digital Intelligence
Splunk Digital Intelligence
 
Role of Tableau on the Data Discovery Market
Role of Tableau on the Data Discovery MarketRole of Tableau on the Data Discovery Market
Role of Tableau on the Data Discovery Market
 
SAP Lumira - Building visualizations
SAP Lumira - Building visualizationsSAP Lumira - Building visualizations
SAP Lumira - Building visualizations
 
SAP Lumira - Acquiring data
SAP Lumira - Acquiring dataSAP Lumira - Acquiring data
SAP Lumira - Acquiring data
 
SAP Lumira - Enriching data
SAP Lumira - Enriching dataSAP Lumira - Enriching data
SAP Lumira - Enriching data
 
Microstrategy for Retailer Company
Microstrategy for Retailer CompanyMicrostrategy for Retailer Company
Microstrategy for Retailer Company
 
SAP BusinessObjects 4.1 Web Intelligence Report Development
SAP BusinessObjects 4.1 Web Intelligence Report DevelopmentSAP BusinessObjects 4.1 Web Intelligence Report Development
SAP BusinessObjects 4.1 Web Intelligence Report Development
 
Sap BusinessObjects 4
Sap BusinessObjects 4Sap BusinessObjects 4
Sap BusinessObjects 4
 
Business objects web intelligence training tasks
Business objects web intelligence training tasksBusiness objects web intelligence training tasks
Business objects web intelligence training tasks
 
Sap business objects 4 quick start manual
Sap business objects 4 quick start manualSap business objects 4 quick start manual
Sap business objects 4 quick start manual
 

Último

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Último (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

  • 1. Building Cloud Self-Service Analytical Solutions By Dmitry Anoshin, Data Engineer, Abebooks (Amazon Subsidiary)
  • 2.
  • 3. Outline • About Myself • About Abebooks • Choosing ETL for the Cloud • Data Acquisition Patterns with Matillion ETL • Set Self-Service BI • Lessons Learned during the journey to the Cloud
  • 4. About Myself • Work with BI since 2007 • Implemented BI in Russia/Europe/Canada
  • 5. Technical Skills Matrix 2015 2010 2007 Databases (Oracle, Teradata, Vertica, Snowflake, Redshift, Mysql, Postgresql, MS SQL Server) ETL (Pentaho DI, Informatica, Matillion ETL) BI (SAP BusinessObje cts, Tableau, Microstrateg y, Pentaho BI, SAS BI) Bigdata (Cloudera Hadoop, Hive, Hue, Splunk, Hunk, ElasticSearch) Digital Marketing (GA, Piwik, Tealium, Adjust, Adobe,) Data Analytics (R, Python) 2018
  • 8. About Abebooks • Online marketplace for books, art & collectibles. • Amazon subsidiary since 2008 we are a marketplace for used books and increasingly non- book-collectibles • 350 Mln listings • 3 in ‘DB Team’ • 2 locations: Victoria, BC and Dusseldorf
  • 9.
  • 10. Abebooks Data Flows • Built by DBAs - db links, PL/SQL, external tables, shell scripts • even before 2015 Redshift was a strategic but ETL re-write too expensive DW Storage Layer Access LayerSource Layer ETL (PL/SQL) Ad-hoc SQL SALES INVENTORY CS SFTP
  • 11. Choosing ETL Tool for Cloud Use Cases • OLTP to S3 • S3 to Redshift • SFTP/API to Redshift • Data Transformation • Dimensional Modelling Tools • Pentaho DI • Informatica • AWD Data Pipeline • Talend • Matillion
  • 12. ETL Criteria High: • Support native Redshift driver • Easily capture from relational db, CDC • Ease of Use for BI/DW • Cover use cases • On-Premise Medium: • Support NoSQL • Company “Winner” • Deployment/Architecture • Encryption • Ease of Use for non BI/DW • Data Transformations • Management • Pricing • Performance Low: • Version Control • Linux OS • ETL Monitoring • Logging • R/Pyhton
  • 13. Why We Picked Matillion • specific redshift support, built around Redshift platform • speed of ETL operations • speed of development • wide range of data sources supported • ease of use outside of DE/DBA expertise • Native with AWS • $$$ • The biggest risk – putting our eggs in the Matillion future, betting on a small and new player.
  • 15. Abebooks Cloud Analytics Architecture Source Systems Amazon Athena Amazon EMR Amazon Redshift Abebooks DW Account DynamoDB Amazon RDS Amazon Redshift Spectrum Amazon Elastic Load Balance S3 Data Lake SQS SNS Amazon Chime Event/Notification ServicesExternal API SFTP APPs Matillion ELT EC2 M4.large 2 vCPU 8 Gb Ram Tableau Server Tableau Web Tableau Desktop Ad-hock SQL End Users Access
  • 16. Pattern 1: getting data via SFTP • Scan SFTP, get all files names, load into Redshift • Identify only new files • Load one ${file_name} per time (using IF we can choose right stream) • Insert processed ${file_name} into Redshift • Load next file Takeaways: • Python BOTO library for managing S3 • Matillion variables ${variable} • Using Matillion Iterators • Execute SQL via Python • If file is missing, try again later
  • 17. Pattern 2: getting data via API • Connect API via Python script • Get data via calls and save to CSV at EC2 • Upload CSV into S3 • Load CSV into Redshift Takeaways: • Using Python to connect external API • Using AWS KMS to encrypt credentials • Using SNS for email notification • Using Matillion system variable for ETL Logs
  • 18. Pattern 3: getting data from DynamoDB Takeaways: • Using DynamoDB component (generate COPY command for you) • You can’t easily get incremental changes, i.e. full reload • Speed depends depends on two things, the "read ratio" and the per-table "read capacity". The actual rows per hour value is going to be based on readRatio * tableReadCapacity. • 51m rows with 35% read ratio and 300 read capacity = 9 hours • 211m rows with 66% read ratio and 1500 read capacity = 4 hours • Reloading once a week
  • 19. Pattern 4: getting data from external S3* Getting data from another VPC – change policy of the bucket and you can see it in the list of buckets through Matillion
  • 20. Pattern 5: Matillion connectors for Apps
  • 21. Pattern 6: Using SQS for Triggering Job Using SQS service we can trigger almost anything in Matillion or AWS
  • 23. BI Survey • ETL was a black box • A lack of notifications • A lack of documentation and trainings • A lack of automation • No dependency between reports and ETL process • High dependency from BI/DW team
  • 24. BI Champions The BI champion is the sheriff, ensuring the townspeople (or business users) be productive and can make analytics fast and smoothly. The BI Champion is meant to be both an evangelist and subject matter expert for BI within the organization. The champion should be well versed in the data important to their team, and knowledgeable in the core BI technologies and patterns used within AbeBooks.
  • 25. ETL Monitor and notifications SNS Topic will send email. In addition we can add any number of Matillion variables Using Amazon Chime Webhook we can execute CURL command via bash script and send message to the business users
  • 26. ETL Monitor Using Matillion system variables we are tracking all events and then visualize via Tableau for end users as well as allow to create alerts in case of failure.
  • 27. ETL Trigger for Tableau Task: Refresh Tableau Data Source (Semantic Layer) & Workbooks when FACT tables are refreshed. Solution: Deploy Tableau CLI tool on EC2 Matillion and run via Bash Script
  • 28. Self-Service BI • Change Management: from report-writing culture to data-driven company • The clear Authority: Support of Executive • The analytic culture: Business executives must have a vision for analytics and the willingness to invest in the people, processes, and technologies for the long haul to ensure a successful outcome. • The right people (data engineers, BI engineers, business analysts) • The right organizational structure: BI Center of Excellence, that establishes and inculcates best practices for building analytical applications • The right data and architecture • The right tools: Redshift, Matillion and Tableau are best for Self-Serve
  • 29. Report Automatization • Central BI Portal • Reusable Tableau Data Sources a.k.a. Business Layer • Common WBR Format • Eliminate manual work • No spreadsheets and ad-hoc SQL queries • Data Discovery • ETL Integration • Friendly drag and drop GUI TL;DR: CTRL+C, CTRL+V, IT dependency • Lots of SQL and Excel routine • Each team define own style and format of report • Multiple metrics definition • No visualization, no alerts • Slow data discovery, hypothesis evaluation
  • 30. Lessons Learned from moving DW into AWS (Cloud)
  • 31. Five Points of Guidance for Redshift (SET DW) 1. Sort Keys: • Choose up to 3 columns • Ordered in increasing order of specificity, balanced with likelihood of use. • Leave INTERLEAVED sort keys for 1 year anniversary. 2. Column Encoding: • Compress all columns except for (at least) the first sort key. 3. Table Maintenance: • VACUUM and ANALYZE tables weekly (use STL_ALERT_EVENT_LOG as a guide for frequency). • ANALYZE PREDICATE COLUMNS is very useful for quick daily stats refresh. 4. Choose a Distribution Key that: • Follows the common join pattern for the table. • Evenly distributes the data across the database slices on the cluster. • DISTSTYLE ALL is a great go-to for dimension tables < ~3 million rows. • DISTSTYLE EVEN is a good fail-safe, but guarantees inter-node data redistribution. 5. Workload Management (WLM) and Query Monitoring Rules (QMR): • Start with up to 3 queues, (in addition to what Redshift provides automatically). • Put ETL in its own queue with very low active_statement count (perhaps as low as 1 or 2). Monitor commit queuing. • Split up the memory across the queues. Monitor the percent of each queue’s workload going to disk. • Expect to change WLM settings to match the workload changes (day|night, weekday|weekend)
  • 32. Lesson One. CHOOSE RIGHT MIGRATION STRATEGY Lift & Shift • Typical Approach • Move all-at-once • Target platform then evolve • Approach gets you to the cloud quickly • Relatively small barrier to learning new technology since it tends to be a close fit
  • 33. Lesson One. CHOOSE RIGHT MIGRATION STRATEGY Split & Flip • Split application into logical functional data layers • Match the data functionality with the right technology • Leverage the wide selection of tools on AWS to best fit the need • Move data in phases — prototype, learn and perfect
  • 34. Lesson Two. CHANGE YOUR MINDSET Take the time to learn • Critical to train and learn the new technologies that are being used • Easy to think about translating or converting • Made many such changes — relational vs non- relational, batch vs streaming, service based vs procedural, etc.
  • 35. Lesson Two. CHANGE YOUR MINDSET Traditional DW — faster runtime is better Cloud — if runtime is slower, it is easy to scale Reality Query #1 uses 64 cores & Query #2 uses 1 core Practical limitation to scale — fixed budget #1 RUNS IN 1 MIN RUNS IN 2 MINS DB DB#2
  • 36. Lesson Two. CHANGE YOUR MINDSET We Optimized For Cost in RedShift • What is the most amount of work that can be done using the given fixed budget? • Focus is on the total amount of work versus optimizing for a single user • Everything you use comes at a cost on the Cloud  DynomoDB performance  Redshift vs Spectrum (S3) Cost is just one example of the many mindset changes that we made
  • 37. Lesson Three. DO NOT SCARRY OPEN BLACK BOX • All business logic is hidden in legacy ETL scripts • Tradeoff between fast project and business users expectation • Learn about your business • Discover and fix the issues
  • 38. Lesson Four. BE AGILE AND INVOLVE BUSINESS Agile Benefits • See results earlier • Feedback Constantly • Serves your users • Flexibility • Quality Assurance
  • 39. Lesson Five. PLAN YOUR EVOLUTION Handling Less Efficient Queries • Provide separate cluster as a SandBox • App Developers design new queries that will fit the constraints of a hands-off operations Example. Create roll-up summary tables in RedShift SUMMARY TABLE

Notas do Editor

  1. company a 'winner' will this tool be supported and fully usable in 3-5 years will this be adopted by Amazon, will there be a community of use recommendations within Amazon (such as AWS SA) years in business, customers, profitability management - scheduling built in - intuitive views of DW processes, models, schedules - does it help someone understand DW data flows deployment / architectures - AWS better than local - linux better than windows - must be patchable platform within Amazon guideline
  2. Biggest risk was the investment in a tool from a small player Porting ETL processes from Matillion would be no less expensive than from PL/SQL and dblinks