SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
June 7, 2018 | 10:00 AM PT
Tivo: How to scale new products
with a data lake on AWS and
Qubole
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s presenters
Paul Sears, Partner Solutions Architect, Amazon Web Services
Harsh Jetly, Solutions Architect, Qubole
Ashish Mrig, Senior Manager, Big Data Analytics, TiVo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s agenda
1. An overview of Amazon Web Services (AWS) with an
emphasis on AWS data lake solutions and Qubole
2. Overview of the Qubole solutions featured in our story
3. Challenges faced by TiVo
4. The TiVo success story with AWS and Qubole
5. Q&A/Discussion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning objectives:
1. How to dramatically reduce management complexities for big data
analytics operations on AWS
2. Best practices for optimizing data lakes for self-service analytics that
enable teams to productionize data science and accelerate data
pipelines
3. Using Presto with Qubole’s auto-scaling management and Spot
Instance Bidding to reduce the complexity, cost, and deployment time
of big data projects
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The data lake and AWS
Drive business value with any type of data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy data warehouses and RDBMS
• Complex to setup and manage
• Do not scale
• Takes months to add new data
sources
• Queries take too long
• Cost $MM upfront
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Should I build a data lake?
Starting by amassing "all your data" and dumping
into a large repository for the data gurus to start
finding "insights" is like trying to win the lottery by
buying all the tickets.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink how to become a data-driven business
1. Business outcomes - start with the insights and
actions you want to drive, then work backwards to a
streamlined design
2. Experimentation - start small, test many ideas, keep
the good ones and scale those up, paying only for what
you consume
3. Agile and timely - deploy data processing
infrastructure in minutes, not months and take
advantage of a rich platform of services to respond
quickly to changing business needs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business outcomes on a modern data
architecture
Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Business case determines platform design
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
START HERE
WITH A BUSINESS CASE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Experiment and scale based on your business
needsMATCH
AVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Amazon S3 for modern data architecture?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 Range GET
 Store as much as you need
 Scale storage and compute
independently
 No minimum usage commitments
Scalable
 Amazon EMR
 Amazon Redshift
 Amazon DynamoDB
 Amazon Athena
IntegratedEasy to use
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple storage and compute
• Legacy design was large databases or
data warehouses with integrated
hardware
• Big data architectures often benefit
from decoupling storage and compute
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake on AWS
AWS
Snowball
AWS
Snowmobile
Amazon
Kinesis
Data Firehose
Amazon
Kinesis
Data Streams
S3
Relational and non-relational data
Schema defined during analysis
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store data at $0.023 / month; Query for $0.05/GB
scanned
Amazon
Redshift
Amazo
n
EMR
Amazo
nAthen
a
Amazo
n
Kinesis Amazon
Elasticsearch Service
Amazon
Kinesis
Video Streams
AI Services
Big data activation for data-
driven companies
Harsh Jetly, Solutions
Architect
Looking at big data operations workflow
16Copyright 2018 © Qubole
Data teams are getting overrun
increasing workloads, costs and risks
Copyright 2018 © Qubole
Petabytes of Data
Big Data Infrastructure
Not enough
expertise to go
around: 190K
unfilled jobs in
US alone
Manual
provisioning
makes it
impossible to
scale
Exploding data,
changing workloads,
new data types
overwhelm data team
Missed SLAs:
data delayed is
data denied
More users
want on
demand access
to data
Data teams under
pressure
17Copyright 2018 © Qubole
Consequence: The Activation Gap
You can’t afford to activate everyone with current economics
Copyright 2018 © Qubole
THE
ACTIVATION
GAP
Growth
Use cases
and Tools
Users and their
expectations
Supply of Big Data skills
IT budget
Volume and
variety of data
Time
Data
security
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
00Copyright 2017 © Qubole
provides your teams the ability
to collaborate and onboard
new projects quickly
Big data can be successful with modern
data lake architecture -
that scales to allow your
Data Teams and Use Cases
to grow with the company
enables your teams
to iterate and
prototype quickly
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
19Copyright 2018 © QuboleCopyright 2018 © Qubole
The transformational promise of
big data workloads are moving to the cloud
58%of big data projects
were on the cloud in
2017*
73%are running big data
projects this year*
*according to dimensional research study
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
20Copyright 2018 © Qubole
AVRO AVRO
Raw
(Staged)
Semi-Structured
Derived
Analytics
‘Source of Truth’
PARQUET
Hive / Spark Hive / Spark
Insert/Update/Delete
Export CSV JSON
Analytic Data
Warehouse
(i.e. Redshift &
Snowflake
environments)
Data Serving
DBs
(i.e. Cassandra,
DynamoDB, etc.)
SPARK
PRESTO Interactive
ad-hoc queries
Use
Cases
Analytics
(i.e. Product
Analytics, BI, User
insights etc.)
Data Products
(i.e. Personalization,
Recommendation etc.)
Data Science
(i.e. Time-series Analysis,
Research etc.)
Data Discovery
(i.e. Exploration, Lineage,
Defined Tables)
Machine
Learning (batch
+ continuous)
Cloud
Compute
Data Lake
Storage
Typical data lake operation
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
21Copyright 2018 © Qubole
What is the status of your big data initiative?
 Deployed but need to reduce cost/complexity of infrastructure
 Expanding deployments, adding more data, users or workloads
 Initial use case deployed but need help to expand
 Have not deployed big data but researching how to do it
 No intention to deploy big data in the next 12 months
22Copyright 2018 © QuboleCopyright 2018 © Qubole
NEXT: FULLY ACTIVATED DATANOW: ACTIVATION GAP to
The imperative:
Shift to a big data activation strategy
Data silos Shared, governed data access
10% active / 90% inert data 90% active / 10% inert data
1:10 ops/users, throw bodies at problem 1:200 ops/users: run on automation + ML
Serviced access to data, tools Self service, collaborative access to data, tools
Focus on infrastructure Focus on business impact
Upside down speed and economics Operate with machine-speed economic
23Copyright 2018 © Qubole
Big data activation stack
2
3
Copyright 2018 © Qubole
Data Scientists
Third-Party
Tools
Data Engineers
Third-Party
Tools
Analysts
Third-Party
Tools
Qubole Big Data Cloud Activation Platform
Autoscaling Caching Spot buying
Alerts &
Insights
Serverless …
…
Cloud Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
24Copyright 2018 © Qubole
A deeper look at autoscaling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
00Copyright 2017 © Qubole
About the Report
In 2017, 54% of all Amazon EC2 compute hours used were spot instances,
resulting in an estimated $230 million in savings of Amazon EC2 costs.*
Spot instance adoption
*Qubole Big Data Activation Report 2018
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
26Copyright 2018 © Qubole
Cluster Life
Cycle
Management
$150M
Workload-aware
Autoscaling
$121M
Spot Shoper
$40M
Cluster Lifecycle Savings
– Amount saved by automatically
terminating a cluster when inactive
Workload-aware Autoscaling Savings
– Amount saved by predictively adjusting
the number of nodes to meet demand
Spot Shopper Savings
– Amount saved by utilizing Amazon EC2
Spot Instances reliably
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
27Copyright 2018 © Qubole
How do you deploy big data today?
 On-premises managing big data software and hardware
 Co-location. 3rd party manages on-premises big data
 In the cloud. You manage big data and cloud infrastructure
 Cloud SaaS. Multi-tenancy big data service from cloud provider
 SaaS vendor. Multi-tenancy big data service from 3rd party
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
28Copyright 2018 © Qubole
How do you deploy big data today?
 On-premise managing big data software and hardware
 Co-location. 3rd party manages on-premise big data software and hardware
 In the cloud managing big data software and cloud infrastructure (EC2, etc.)
 Cloud provider SaaS. Multi-tenancy big data service managed by Cloud Provider
 3rd party vendor SaaS. Multi-tenancy big data service managed by 3rd party company
 None of the above
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
29Copyright 2018 © Qubole
162%Growth in Open
Source Engine Usage
Globally
298% growth in Apache Spark
420% growth in Presto
102% growth in Apache Hadoop/Hive
Total Engine Usage Globally By Compute Hours
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
30Copyright 2018 © Qubole
Movement to
multi-engine
Companies are increasingly
deploying multiple OSS
engines for different use
cases (ML, ETL, analytics,
etc.)
Users getting
more access
More users have access
to data and are running
more commands and
collaborating
Cloud benefits
recognized
Companies are
leveraging cloud for rapid
innovation and
automation to scale
Ashish.mrig @ TiVo.com
+ +
How is Presto used?
Targeted Audience Delivery
Targeted Audience Delivery
brought to you (in part) by
Why Presto ?
• Storage/Compute Separation
• Easy to add and remove worker nodes
• Query many different data sources (inside our VPC)
without separate load
• Good performance for analytical queries.
Not so good for transactional and simple queries…
• Managed (e.g., Qubole, Starburst)
How Presto
Works
Data is streamed
back to the workers
Lesson learned:
What instance types should we use?
Memory Pools:
• System memory pool (40% of Java heap space)
• Reserved memory pool (largest query’s memory usage)
• General memory pool (the rest of the memory)
• What if memory usage varies a lot between different queries?
• Use many inexpensive instances, or a few expensive instances?
• Compute optimized or memory optimized?
Working with reserved memory pool
How do we achieve that?
Conceptually, reserved memory pool should be the “high water mark”
while most queries complete in the general pool.
Solution: multiple clusters based on workload
Empiric testing found large instance type was slightly faster
Solution: Cost/Benefit Analysis
Choosing the Right Instance Type
r 4 . 4 x l a r g e
Instance
Class
Generation
Multiplier
For CPU and Mem
t 2 . 2 x l a r g e
c 5 . 16x l a r g e
Over 100 to choose from!
Choosing the Right Instance Type
Credit: Willard Simmons (DataXu)
Choosing the Right Instance Type
Newer instances are
more efficient
Credit: Willard Simmons (DataXu)
Better for larger
memory clusters
Newer instances are
more efficient
Credit: Willard Simmons (DataXu)
Choosing the Right Instance Type
Better for smaller
memory clusters
Newer instances are
more efficient
Credit: Willard Simmons (DataXu)
Choosing the Right Instance Type
Lesson learned:
Elastic Scaling
Average Presto query
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Presto
Worker
Presto
Worker
Presto
Coordinator
10 Queries
When will queries complete
at current rate?
Not fast enough!
Concurrent Presto queries
Presto
Worker
Presto
Worker
Presto
Coordinator
10 Queries
When will queries complete
at current rate?
Qubole provisions more nodes up to a limit
(around 3 minutes)
Presto
Worker
Presto
Worker
More concurrency? Scale up
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Presto
Worker
Presto
Worker
Too fast!
Back to single Presto query
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Qubole decommissions more nodes up to a limit
Scale down
My big fat Presto query
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Not fast enough!
100% CPU 100% CPU
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Upscaling only works for new queries
Presto
Worker
Presto
Worker
100% CPU 100% CPUIdle Idle
Not so fast…
Not fast enough!
Maybe we should have sent this
query to a more powerful cluster?
Autoscaling is for concurrency
Results
Elastic scaling: Spin the nodes up/down based on demand
Benefit: Cost savings
Specialized clusters: Different clusters for different workload
Benefit: Efficiency
Storage/Compute separation: Store on Amazon S3, serve using Presto
Benefit: Scalability and data availability
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next steps and further information
• Data Lake solution on AWS:
https://aws.amazon.com/big-data/data-lake-on-aws/
• Get started with Qubole:
https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/
• Try AWS for free:
https://aws.amazon.com/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q & A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!

Mais conteúdo relacionado

Mais procurados

Managing Microsoft Workloads on AWS.pdf
Managing Microsoft Workloads on AWS.pdfManaging Microsoft Workloads on AWS.pdf
Managing Microsoft Workloads on AWS.pdfAmazon Web Services
 
AWSome Day Iceland - Technical Track
AWSome Day Iceland - Technical TrackAWSome Day Iceland - Technical Track
AWSome Day Iceland - Technical TrackAmazon Web Services
 
Protecting Your Data- AWS Security Tools and Features
Protecting Your Data- AWS Security Tools and FeaturesProtecting Your Data- AWS Security Tools and Features
Protecting Your Data- AWS Security Tools and FeaturesAmazon Web Services
 
Enabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWSEnabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWSAmazon Web Services
 
SID303 Navigating GDPR Compliance on AWS
 SID303 Navigating GDPR Compliance on AWS SID303 Navigating GDPR Compliance on AWS
SID303 Navigating GDPR Compliance on AWSAmazon Web Services
 
Managed Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDSManaged Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDSAmazon Web Services
 
AWSome Day Geneva Main Track: Infrastructure Part 1.pdf
AWSome Day Geneva Main Track: Infrastructure Part 1.pdfAWSome Day Geneva Main Track: Infrastructure Part 1.pdf
AWSome Day Geneva Main Track: Infrastructure Part 1.pdfAmazon Web Services
 
SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway
 SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway
SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage GatewayAmazon Web Services
 
Trimono Gains Reliable Backup and Recovery on AWS with Veritas
 Trimono Gains Reliable Backup and Recovery on AWS with Veritas Trimono Gains Reliable Backup and Recovery on AWS with Veritas
Trimono Gains Reliable Backup and Recovery on AWS with VeritasAmazon Web Services
 
Migrate & Modernize your legacy Microsoft applications with AWS
Migrate & Modernize your legacy Microsoft applications with AWSMigrate & Modernize your legacy Microsoft applications with AWS
Migrate & Modernize your legacy Microsoft applications with AWSAmazon Web Services
 
Introduction to the Security Perspective of the Cloud Adoption Framework
Introduction to the Security Perspective of the Cloud Adoption FrameworkIntroduction to the Security Perspective of the Cloud Adoption Framework
Introduction to the Security Perspective of the Cloud Adoption FrameworkAmazon Web Services
 
Analisi avanzata di video e immagini con i servizi AI di AWS
Analisi avanzata di video e immagini con i servizi AI di AWSAnalisi avanzata di video e immagini con i servizi AI di AWS
Analisi avanzata di video e immagini con i servizi AI di AWSAmazon Web Services
 
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...Amazon Web Services
 
How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT
 How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT
How Symantec Cloud Workload Protection Secures LifeLock on AWS PPTAmazon Web Services
 
A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...
A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...
A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...Amazon Web Services
 

Mais procurados (20)

Managing Microsoft Workloads on AWS.pdf
Managing Microsoft Workloads on AWS.pdfManaging Microsoft Workloads on AWS.pdf
Managing Microsoft Workloads on AWS.pdf
 
AWSome Day Iceland - Technical Track
AWSome Day Iceland - Technical TrackAWSome Day Iceland - Technical Track
AWSome Day Iceland - Technical Track
 
AWS Storage Stage of Union
AWS Storage Stage of UnionAWS Storage Stage of Union
AWS Storage Stage of Union
 
Protecting Your Data- AWS Security Tools and Features
Protecting Your Data- AWS Security Tools and FeaturesProtecting Your Data- AWS Security Tools and Features
Protecting Your Data- AWS Security Tools and Features
 
Enabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWSEnabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWS
 
SID303 Navigating GDPR Compliance on AWS
 SID303 Navigating GDPR Compliance on AWS SID303 Navigating GDPR Compliance on AWS
SID303 Navigating GDPR Compliance on AWS
 
Managed Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDSManaged Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDS
 
AWSome Day Geneva Main Track: Infrastructure Part 1.pdf
AWSome Day Geneva Main Track: Infrastructure Part 1.pdfAWSome Day Geneva Main Track: Infrastructure Part 1.pdf
AWSome Day Geneva Main Track: Infrastructure Part 1.pdf
 
SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway
 SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway
SRV302 Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway
 
Trimono Gains Reliable Backup and Recovery on AWS with Veritas
 Trimono Gains Reliable Backup and Recovery on AWS with Veritas Trimono Gains Reliable Backup and Recovery on AWS with Veritas
Trimono Gains Reliable Backup and Recovery on AWS with Veritas
 
AWS Espressif Amazon FreeRTOS
AWS Espressif Amazon FreeRTOSAWS Espressif Amazon FreeRTOS
AWS Espressif Amazon FreeRTOS
 
Migrate & Modernize your legacy Microsoft applications with AWS
Migrate & Modernize your legacy Microsoft applications with AWSMigrate & Modernize your legacy Microsoft applications with AWS
Migrate & Modernize your legacy Microsoft applications with AWS
 
Introduction to the Security Perspective of the Cloud Adoption Framework
Introduction to the Security Perspective of the Cloud Adoption FrameworkIntroduction to the Security Perspective of the Cloud Adoption Framework
Introduction to the Security Perspective of the Cloud Adoption Framework
 
Analisi avanzata di video e immagini con i servizi AI di AWS
Analisi avanzata di video e immagini con i servizi AI di AWSAnalisi avanzata di video e immagini con i servizi AI di AWS
Analisi avanzata di video e immagini con i servizi AI di AWS
 
Tape Replacement
Tape ReplacementTape Replacement
Tape Replacement
 
Amazon Container Services
Amazon Container ServicesAmazon Container Services
Amazon Container Services
 
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
How UCSD Simplified Data Protection with Rubrik and AWS (STG207-S) - AWS re:I...
 
Getting Started with AWS
Getting Started with AWSGetting Started with AWS
Getting Started with AWS
 
How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT
 How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT
How Symantec Cloud Workload Protection Secures LifeLock on AWS PPT
 
A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...
A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...
A Practitioner's Guide to Securing Your Cloud (Like an Expert) (SEC203-R1) - ...
 

Semelhante a TiVo: How to Scale New Products with a Data Lake on AWS and Qubole

Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSAmazon Web Services
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-ValueAmazon Web Services
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise Amazon Web Services
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with DatabricksAmazon Web Services
 
How Different Large Organizations are Approaching Cloud Adoption
How Different Large Organizations are Approaching Cloud AdoptionHow Different Large Organizations are Approaching Cloud Adoption
How Different Large Organizations are Approaching Cloud AdoptionAmazon Web Services
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_Singapore100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_SingaporeAmazon Web Services
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)Amazon Web Services
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTAmazon Web Services
 
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1Amazon Web Services
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Amazon Web Services
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETLAmazon Web Services
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Amazon Web Services
 
Accelerate Database Migration to AWS with DB Best
 Accelerate Database Migration to AWS with DB Best Accelerate Database Migration to AWS with DB Best
Accelerate Database Migration to AWS with DB BestAmazon Web Services
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSAmazon Web Services
 
GPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyGPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyAmazon Web Services
 
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...Amazon Web Services
 

Semelhante a TiVo: How to Scale New Products with a Data Lake on AWS and Qubole (20)

Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWS
 
Automating Big Data Technologies for Faster Time-to-Value
 Automating Big Data Technologies for Faster Time-to-Value Automating Big Data Technologies for Faster Time-to-Value
Automating Big Data Technologies for Faster Time-to-Value
 
Architecting an Open Data Lake for the Enterprise
 Architecting an Open Data Lake for the Enterprise  Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with Databricks
 
How Different Large Organizations are Approaching Cloud Adoption
How Different Large Organizations are Approaching Cloud AdoptionHow Different Large Organizations are Approaching Cloud Adoption
How Different Large Organizations are Approaching Cloud Adoption
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_Singapore100 Billion Data Points With Lambda_AWSPSSummit_Singapore
100 Billion Data Points With Lambda_AWSPSSummit_Singapore
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
 
Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
 
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
AWS Data-Driven Insights Learning Series ANZ Sep 2019 Part 1
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
 
BI & Analytics
BI & AnalyticsBI & Analytics
BI & Analytics
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
Accelerate Database Migration to AWS with DB Best
 Accelerate Database Migration to AWS with DB Best Accelerate Database Migration to AWS with DB Best
Accelerate Database Migration to AWS with DB Best
 
BI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWSBI & Analytics - A Datalake on AWS
BI & Analytics - A Datalake on AWS
 
GPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made EasyGPSWKS301_Comprehensive Big Data Architecture Made Easy
GPSWKS301_Comprehensive Big Data Architecture Made Easy
 
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
Comprehensive Big Data Analytics Architecture Made Easy - The AWS Marketplace...
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

TiVo: How to Scale New Products with a Data Lake on AWS and Qubole

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. June 7, 2018 | 10:00 AM PT Tivo: How to scale new products with a data lake on AWS and Qubole © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today’s presenters Paul Sears, Partner Solutions Architect, Amazon Web Services Harsh Jetly, Solutions Architect, Qubole Ashish Mrig, Senior Manager, Big Data Analytics, TiVo
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today’s agenda 1. An overview of Amazon Web Services (AWS) with an emphasis on AWS data lake solutions and Qubole 2. Overview of the Qubole solutions featured in our story 3. Challenges faced by TiVo 4. The TiVo success story with AWS and Qubole 5. Q&A/Discussion
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learning objectives: 1. How to dramatically reduce management complexities for big data analytics operations on AWS 2. Best practices for optimizing data lakes for self-service analytics that enable teams to productionize data science and accelerate data pipelines 3. Using Presto with Qubole’s auto-scaling management and Spot Instance Bidding to reduce the complexity, cost, and deployment time of big data projects
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The data lake and AWS Drive business value with any type of data
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy data warehouses and RDBMS • Complex to setup and manage • Do not scale • Takes months to add new data sources • Queries take too long • Cost $MM upfront
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Should I build a data lake? Starting by amassing "all your data" and dumping into a large repository for the data gurus to start finding "insights" is like trying to win the lottery by buying all the tickets.
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rethink how to become a data-driven business 1. Business outcomes - start with the insights and actions you want to drive, then work backwards to a streamlined design 2. Experimentation - start small, test many ideas, keep the good ones and scale those up, paying only for what you consume 3. Agile and timely - deploy data processing infrastructure in minutes, not months and take advantage of a rich platform of services to respond quickly to changing business needs
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business outcomes on a modern data architecture Outcome 1 : Modernize and consolidate • Insights to enhance business applications and create new digital services Outcome 2 : Innovate for new revenues • Personalization, demand forecasting, risk analysis Outcome 3 : Real-time engagement • Interactive customer experience, event-driven automation, fraud detection Outcome 4 : Automate for expansive reach • Automation of business processes and physical infrastructure
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business case determines platform design Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & Insights START HERE WITH A BUSINESS CASE
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experiment and scale based on your business needsMATCH AVAILABLE DATA Metrics and Monitoring Workflow Logs ERP Transactions Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & Insights
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Amazon S3 for modern data architecture? Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  Amazon EMR  Amazon Redshift  Amazon DynamoDB  Amazon Athena IntegratedEasy to use  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple storage and compute • Legacy design was large databases or data warehouses with integrated hardware • Big data architectures often benefit from decoupling storage and compute
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake on AWS AWS Snowball AWS Snowmobile Amazon Kinesis Data Firehose Amazon Kinesis Data Streams S3 Relational and non-relational data Schema defined during analysis Unmatched durability and availability at EB scale Best security, compliance, and audit capabilities Run any analytics on the same data without movement Scale storage and compute independently Store data at $0.023 / month; Query for $0.05/GB scanned Amazon Redshift Amazo n EMR Amazo nAthen a Amazo n Kinesis Amazon Elasticsearch Service Amazon Kinesis Video Streams AI Services
  • 15. Big data activation for data- driven companies Harsh Jetly, Solutions Architect Looking at big data operations workflow
  • 16. 16Copyright 2018 © Qubole Data teams are getting overrun increasing workloads, costs and risks Copyright 2018 © Qubole Petabytes of Data Big Data Infrastructure Not enough expertise to go around: 190K unfilled jobs in US alone Manual provisioning makes it impossible to scale Exploding data, changing workloads, new data types overwhelm data team Missed SLAs: data delayed is data denied More users want on demand access to data Data teams under pressure
  • 17. 17Copyright 2018 © Qubole Consequence: The Activation Gap You can’t afford to activate everyone with current economics Copyright 2018 © Qubole THE ACTIVATION GAP Growth Use cases and Tools Users and their expectations Supply of Big Data skills IT budget Volume and variety of data Time Data security
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 00Copyright 2017 © Qubole provides your teams the ability to collaborate and onboard new projects quickly Big data can be successful with modern data lake architecture - that scales to allow your Data Teams and Use Cases to grow with the company enables your teams to iterate and prototype quickly
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 19Copyright 2018 © QuboleCopyright 2018 © Qubole The transformational promise of big data workloads are moving to the cloud 58%of big data projects were on the cloud in 2017* 73%are running big data projects this year* *according to dimensional research study
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 20Copyright 2018 © Qubole AVRO AVRO Raw (Staged) Semi-Structured Derived Analytics ‘Source of Truth’ PARQUET Hive / Spark Hive / Spark Insert/Update/Delete Export CSV JSON Analytic Data Warehouse (i.e. Redshift & Snowflake environments) Data Serving DBs (i.e. Cassandra, DynamoDB, etc.) SPARK PRESTO Interactive ad-hoc queries Use Cases Analytics (i.e. Product Analytics, BI, User insights etc.) Data Products (i.e. Personalization, Recommendation etc.) Data Science (i.e. Time-series Analysis, Research etc.) Data Discovery (i.e. Exploration, Lineage, Defined Tables) Machine Learning (batch + continuous) Cloud Compute Data Lake Storage Typical data lake operation
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 21Copyright 2018 © Qubole What is the status of your big data initiative?  Deployed but need to reduce cost/complexity of infrastructure  Expanding deployments, adding more data, users or workloads  Initial use case deployed but need help to expand  Have not deployed big data but researching how to do it  No intention to deploy big data in the next 12 months
  • 22. 22Copyright 2018 © QuboleCopyright 2018 © Qubole NEXT: FULLY ACTIVATED DATANOW: ACTIVATION GAP to The imperative: Shift to a big data activation strategy Data silos Shared, governed data access 10% active / 90% inert data 90% active / 10% inert data 1:10 ops/users, throw bodies at problem 1:200 ops/users: run on automation + ML Serviced access to data, tools Self service, collaborative access to data, tools Focus on infrastructure Focus on business impact Upside down speed and economics Operate with machine-speed economic
  • 23. 23Copyright 2018 © Qubole Big data activation stack 2 3 Copyright 2018 © Qubole Data Scientists Third-Party Tools Data Engineers Third-Party Tools Analysts Third-Party Tools Qubole Big Data Cloud Activation Platform Autoscaling Caching Spot buying Alerts & Insights Serverless … … Cloud Data Lake
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 24Copyright 2018 © Qubole A deeper look at autoscaling
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 00Copyright 2017 © Qubole About the Report In 2017, 54% of all Amazon EC2 compute hours used were spot instances, resulting in an estimated $230 million in savings of Amazon EC2 costs.* Spot instance adoption *Qubole Big Data Activation Report 2018
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26Copyright 2018 © Qubole Cluster Life Cycle Management $150M Workload-aware Autoscaling $121M Spot Shoper $40M Cluster Lifecycle Savings – Amount saved by automatically terminating a cluster when inactive Workload-aware Autoscaling Savings – Amount saved by predictively adjusting the number of nodes to meet demand Spot Shopper Savings – Amount saved by utilizing Amazon EC2 Spot Instances reliably
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 27Copyright 2018 © Qubole How do you deploy big data today?  On-premises managing big data software and hardware  Co-location. 3rd party manages on-premises big data  In the cloud. You manage big data and cloud infrastructure  Cloud SaaS. Multi-tenancy big data service from cloud provider  SaaS vendor. Multi-tenancy big data service from 3rd party
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 28Copyright 2018 © Qubole How do you deploy big data today?  On-premise managing big data software and hardware  Co-location. 3rd party manages on-premise big data software and hardware  In the cloud managing big data software and cloud infrastructure (EC2, etc.)  Cloud provider SaaS. Multi-tenancy big data service managed by Cloud Provider  3rd party vendor SaaS. Multi-tenancy big data service managed by 3rd party company  None of the above
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 29Copyright 2018 © Qubole 162%Growth in Open Source Engine Usage Globally 298% growth in Apache Spark 420% growth in Presto 102% growth in Apache Hadoop/Hive Total Engine Usage Globally By Compute Hours
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30Copyright 2018 © Qubole Movement to multi-engine Companies are increasingly deploying multiple OSS engines for different use cases (ML, ETL, analytics, etc.) Users getting more access More users have access to data and are running more commands and collaborating Cloud benefits recognized Companies are leveraging cloud for rapid innovation and automation to scale
  • 32. How is Presto used? Targeted Audience Delivery
  • 33. Targeted Audience Delivery brought to you (in part) by
  • 34. Why Presto ? • Storage/Compute Separation • Easy to add and remove worker nodes • Query many different data sources (inside our VPC) without separate load • Good performance for analytical queries. Not so good for transactional and simple queries… • Managed (e.g., Qubole, Starburst)
  • 35. How Presto Works Data is streamed back to the workers
  • 36. Lesson learned: What instance types should we use?
  • 37. Memory Pools: • System memory pool (40% of Java heap space) • Reserved memory pool (largest query’s memory usage) • General memory pool (the rest of the memory)
  • 38. • What if memory usage varies a lot between different queries? • Use many inexpensive instances, or a few expensive instances? • Compute optimized or memory optimized? Working with reserved memory pool How do we achieve that? Conceptually, reserved memory pool should be the “high water mark” while most queries complete in the general pool. Solution: multiple clusters based on workload Empiric testing found large instance type was slightly faster Solution: Cost/Benefit Analysis
  • 39. Choosing the Right Instance Type r 4 . 4 x l a r g e Instance Class Generation Multiplier For CPU and Mem t 2 . 2 x l a r g e c 5 . 16x l a r g e Over 100 to choose from!
  • 40. Choosing the Right Instance Type Credit: Willard Simmons (DataXu)
  • 41. Choosing the Right Instance Type Newer instances are more efficient Credit: Willard Simmons (DataXu)
  • 42. Better for larger memory clusters Newer instances are more efficient Credit: Willard Simmons (DataXu) Choosing the Right Instance Type
  • 43. Better for smaller memory clusters Newer instances are more efficient Credit: Willard Simmons (DataXu) Choosing the Right Instance Type
  • 45. Average Presto query Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate?
  • 46. Presto Worker Presto Worker Presto Coordinator 10 Queries When will queries complete at current rate? Not fast enough! Concurrent Presto queries
  • 47. Presto Worker Presto Worker Presto Coordinator 10 Queries When will queries complete at current rate? Qubole provisions more nodes up to a limit (around 3 minutes) Presto Worker Presto Worker More concurrency? Scale up
  • 48. Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Presto Worker Presto Worker Too fast! Back to single Presto query
  • 49. Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Qubole decommissions more nodes up to a limit Scale down
  • 50. My big fat Presto query Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Not fast enough! 100% CPU 100% CPU
  • 51. Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Upscaling only works for new queries Presto Worker Presto Worker 100% CPU 100% CPUIdle Idle Not so fast… Not fast enough! Maybe we should have sent this query to a more powerful cluster? Autoscaling is for concurrency
  • 52. Results Elastic scaling: Spin the nodes up/down based on demand Benefit: Cost savings Specialized clusters: Different clusters for different workload Benefit: Efficiency Storage/Compute separation: Store on Amazon S3, serve using Presto Benefit: Scalability and data availability
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Next steps and further information • Data Lake solution on AWS: https://aws.amazon.com/big-data/data-lake-on-aws/ • Get started with Qubole: https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/ • Try AWS for free: https://aws.amazon.com/
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q & A
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!