SlideShare uma empresa Scribd logo
1 de 38
HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA
SQL ANALYTICS PROCESSING IN THE CLOUD
A REAL WORLD CASE STUDY
Jaipaul Agonus
FINRA
Strata Hadoop World
New York, Sep 2015
FINRA - WHAT DO WE DO?
Collect and Create
• Up to 75 billion events per day
• 13 National Exchanges, 5 Reporting Facilities
• Reconstruct the market from trillions of events spanning years
Detect & Investigate
• Identify market manipulations, insider trading, fraud and compliance
violations
Enforce & Discipline
• Ensure rule compliance
• Fine and bar broker dealers
• Refer matters to the SEC and other authorities
2
FINRA’S SURVEILLANCE ALGORITHMS
Hundreds of surveillance algorithms against massive amounts of data in
multiple products (Equities, Options, etc.) and across multiple
exchanges (NASDAQ, NYSE, CBOE, etc.).
3
Best Execution
Layering
Compliance
Detecting abusive activity and compliance breaches
FINRA’S SURVEILLANCE ALGORITHMS
4
FINRA’S SURVEILLANCE ALGORITHMS
Dealing with big data before there was “Big Data”.
Over 430 batch analytics in the Surveillance suite.
“Massively Parallel Processing” methodology used to solve big-data
problems in the legacy world.
5
PRE-HADOOP DATA ARCHITECTURE
 Tiered storage design that struggles to balance Cost, Performance and Flexibility
6
PRE-HADOOP PAIN POINTS
Data Silos
Data distributed physically across MPP appliances, NAS and Tapes,
affecting accessibility and efficiency.
7
PRE-HADOOP PAIN POINTS
Cost
Expensive, specialized hardware tuned for CPU, storage and network
performance.
Proprietary software that comes with relative high cost & vendor lock-in.
8
PRE-HADOOP PAIN POINTS
Non-Elasticity
Can't grow or shrink easily with data volume, bound by the cost of
hardware in the appliance and the relative high cost of the software.
9
ADDRESSING PAIN POINTS WITH..
HIVE
De facto standard for
SQL-on-Hadoop
Amazon EMR
Amazon Elastic Map
Reduce – Managed
Hadoop Framework
Amazon S3
Amazon Simple
Storage Service With
Practically Infinite
Storage
10
WHY SQL?
Heavily SQL based legacy application that works on source data
that’s already available in structured format.
Hundreds of thousands of lines of legacy SQL code, iterated and
tested rigorously through the years and readily available for easy
Hive porting.
Developers, testers, data scientists and analysts with deep SQL
skills and necessary business acumen already part of FINRA
culture.
11
HIVE
Developed at Facebook and now open-source de facto SQL standard for Hadoop.
Been around for seven years, battle tested at scale, and widely used across industries.
Powerful abstraction over MapReduce jobs, translates HiveQL into map-reduce jobs that
work against distributed dataset.
12
HIVE – EXECUTION ENGINES
MapReduce
Mature batch-processing platform for the petabyte scale!
Does not perform well enough for small data or iterative calculations with
long data pipelines.
Tez
Aims to translate complex SQL statements into an optimized, purpose-
built data processing graphs.
Strikes a balance between performance, throughput, and scalability.
Spark
Fast in-memory computing, allows you to leverage available memory by
fitting in all intermediate data.
13
AMAZON(AWS) S3
Cost-effective and durable object storage for a variety of content.
Allows separation of storage from compute resources, providing ability to scale each
independently in pay-as-you-go pricing model.
Meets Hadoop’s file system requirements and integrates well with EMR.
14
EMR – ELASTIC MAP REDUCE
Managed hadoop framework, easy to deploy and manage Hadoop clusters.
Easy, fast, and cost-effective way to distribute and process vast amounts of data
across dynamically scalable Amazon EC2 Instances.
15
EMR INSTANCE TYPES
A wide selection of virtual Linux servers with varying combinations of
CPU, memory, storage, and networking capacity.
Each instance type includes one or more instance sizes, allowing you to
scale your resources to target workload.
Available in Spot (your bid price) or On-demand model.
[Source – Amazon] 16
CLOUD BASED HADOOP ARCHITECTURE
17
UNDERLYING DESIGN PATTERN
[Source – Amazon]
Cluster lives just for the duration of the
job, shutdown when the job is done.
Persist input and output data in S3.
Run multiple jobs in multiple Amazon
EMR clusters over the same data set (in
S3) without overloading your HDFS
nodes.
Transient Clusters with S3 as HDFS
18
UNDERLYING DESIGN PATTERN – BENEFITS
[Source – Amazon]
Control your cost, pay for what you use.
Minimum maintenance, cluster goes away when the job is done.
Persisting data in S3 enables easy reprocessing if the spot Instances are
taken away due to an outbid.
19
LESSON LEARNED
Design your Hive batch analytics with focus on enabling direct data
access and maximizing resource utilization.
20
HIVE – DESIGNING FOR PERFORMANCE
Enable Direct Data Access
Partition data along natural query boundaries (e.g. trade date) and process only
the data you need.
Improve join performance by bucketing and sorting data ahead of time and
reduce I/O scans during join process.
Use broadcast joins when joining small tables.
21
HIVE – DESIGNING FOR PERFORMANCE
Tune Hive Configurations
Increase parallelism by tuning MapReduce split size to use all the available map
slots in the cluster.
Increase the replication factor on dimension tables (not for additional resiliency
but for better performance).
Compress intermediate data and reduce data transfer between
mappers/reducers.
22
LESSON LEARNED
Measure and profile your clusters from the outset and adjust
continuously keeping pace with changing data size, processing flow
and execution frameworks.
23
CLUSTER PROFILING - SAMPLE
Batch Process:
• Order Audit Trail Report Card Validation
Source Dataset:
• 1 month of market orders data
• Over 100 billion rows
• Around 5 terabytes in size
• Input stored in S3
Instance Choices:
• m1.xlarge (General Purpose) - 4 CPUs, 15 GB RAM, 4 x 420 Disk Storage, High network performance
• c3.8xlarge (Compute Optimized) - 32 CPUs, 60 GB RAM, 2 x 320 SSD Storage, High network performance
Run Options:
Attribute Option-1 Option-2 Option-3 Option-4
Instance Type m1.xl C3.8xl C3.8xl C3.8xl
Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized
Cluster Size 100 16 32 100
24
CLUSTER PROFILE – COMPARISON RESULTS
100 M1s Vs 16 C3s Vs 32 C3s Vs 100 C3s
Attribute Option-1 Option-2 Option-3 Option-4
Instance Type m1.xl C3.8xl C3.8xl C3.8xl
Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized
Cluster Size 100 16 32 100
Cost $57.33 $111.86 $127.05 $312.00
Time 3 Hrs 12 mins 8 Hrs 4 Hrs 56 mins 3 Hrs
Approximate Spot Price per Instance $0.14 $0.78 $0.66 $0.78
Peak Disk Usage 0.04% 2.10% 1.20% 0.45%
Peak Memory Usage 56% 43% 45% 41%
Peak CPU Usage 100% 95% 98% 92%
25
LESSON LEARNED
Abstract execution framework (MR, Tez, Spark) from business logic
in your app architecture. This allows switching frameworks, keeping
up with Hive community.
26
LESSON LEARNED
Hive UDFs (User Defined Functions) can solve your complex
problems when SQL falls short.
27
HIVE UDFS
Used when Hive functionality falls short
• e.g. window function with ignore nulls – not supported in Hive
• e.g. date formatting functions – Hive 1.2 has better support
Used when non-procedural SQL can’t accomplish the task
• e.g. de-dupe many-to-many time series pairings
A->B 10AM, A->C 11AM, D->B 11AM, D->C 11AM, E->C 12PM
28
LESSON LEARNED
Choose an optimal storage format (row, columnar) and compression
type (splittable, space efficient, fast).
29
FILE STORAGE – FORMAT AND COMPRESSION
[Source – Amazon]
STORAGE FORMATS
Columnar or Row based storage
RCFILE, ORC, PARQUET – Columnar formats, skip loading unwanted
columns!
COMPRESSION TYPES
Reduce the number of bytes written to/read from HDFS
Some are fast but offer less space reduction, some are space efficient but
slower, some are splittable and some are not.
30
Compression Extension Splittable Encoding/Decoding
Speed (Scale 1-4)
Space Savings %
(scale 1-4)
Gzip gz no 1 4
lzo lzo yes if indexed 2 2
bzip2 bz2 yes 3 3
snappy snappy no 4 1
LESSON LEARNED
Being at the bleeding edge of sql-on-hadoop has its risks, you need
the right strategy to mitigate risks/challenges along the way.
31
RISKS AND MITIGATION STRATEGIES
Hive, released in late 2008, Tez/Spark backend added recently.
Traditional platforms like Oracle tested and iterated through four decades.
Discovered issues with Hadoop/Hive during migration that are related to
compression, storage types and memory management.
Performance issues with interactive analytics – Impacts Data Scientists & Business
Analysts.
32
RISKS AND MITIGATION STRATEGIES
Extensive parallel run comparison (apples-to-apples) against legacy production
data to identify issues before production roll out.
Partnered with Hadoop and Cloud vendors to address Hadoop/Hive issues and
feature requests quickly.
Push of a button automated regression test suites to kick the tires and test
functionality for any minor software increments.
Analyzing Presto/Tez to solve interactive analytics problems.
33
RISKS AND MITIGATION STRATEGIES
Took advantage of cloud elasticity to complete parallel runs
against production volume at scale to produce results swiftly.
34
END STATE
COST
Cost savings in infrastructure relative to our legacy environment.
Minimal upfront spending, pay-as-you-go pricing model.
Variety of machine types and size to choose from based on cost and
performance needs.
35
END STATE
FLEXIBILITY
Dynamic infrastructure provides great flexibility with faster
reprocessing and testing needs.
Simplified software procurement and license management.
Ease of exploratory runs without affecting production workload.
36
END STATE
SCALABILITY
Scale out at will easily on high volume days.
Cloud elasticity enables running multiple days at scale in parallel.
Reprocessing for historic days is completed in hours compared to
weeks in our legacy environment.
37
QUESTIONS?
Jaipaul Agonus
FINRA
Jaipaul.agonus@finra.org
linkedin.com/in/jaipaulagonus
38

Mais conteúdo relacionado

Mais procurados

Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Amazon Web Services
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOLAmazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduceAmazon Web Services
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)Amazon Web Services
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)Amazon Web Services
 

Mais procurados (20)

Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL(BDT210) Building Scalable Big Data Solutions: Intel & AOL
(BDT210) Building Scalable Big Data Solutions: Intel & AOL
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
AWS re:Invent 2016: How to Scale and Operate Elasticsearch on AWS (DEV307)
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
 

Destaque

AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)Amazon Web Services
 
Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...
Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...
Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...FINRATech
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 
Optimizing costs with spot instances
Optimizing costs with spot instancesOptimizing costs with spot instances
Optimizing costs with spot instancesAmazon Web Services
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...MapR Technologies
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Best Practices for Architecting a Pragmatic Web API.
Best Practices for Architecting a Pragmatic Web API.Best Practices for Architecting a Pragmatic Web API.
Best Practices for Architecting a Pragmatic Web API.Mario Cardinal
 
AWS SQS for better architecture
AWS SQS for better architectureAWS SQS for better architecture
AWS SQS for better architectureSaurabh Bangad
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveYifeng Jiang
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services
 
Kanban Board Examples
Kanban Board ExamplesKanban Board Examples
Kanban Board ExamplesShore Labs
 
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessSurprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessDivante
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Divante
 
Omnichannel Customer Experience
Omnichannel Customer ExperienceOmnichannel Customer Experience
Omnichannel Customer ExperienceDivante
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionDavid Pittman
 

Destaque (18)

AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
 
Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...
Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...
Using CloudBees Jenkins Enterprise to Effectively Manage the Jenkins Ecosyste...
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Optimizing costs with spot instances
Optimizing costs with spot instancesOptimizing costs with spot instances
Optimizing costs with spot instances
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Best Practices for Architecting a Pragmatic Web API.
Best Practices for Architecting a Pragmatic Web API.Best Practices for Architecting a Pragmatic Web API.
Best Practices for Architecting a Pragmatic Web API.
 
AWS SQS for better architecture
AWS SQS for better architectureAWS SQS for better architecture
AWS SQS for better architecture
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
 
Kanban Board Examples
Kanban Board ExamplesKanban Board Examples
Kanban Board Examples
 
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessSurprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)
 
Omnichannel Customer Experience
Omnichannel Customer ExperienceOmnichannel Customer Experience
Omnichannel Customer Experience
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 

Semelhante a Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud

AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Amazon Web Services
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)Steve Min
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 

Semelhante a Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud (20)

AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
Real-world Cloud HPC at Scale, for Production Workloads (BDT212) | AWS re:Inv...
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud

  • 1. HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY Jaipaul Agonus FINRA Strata Hadoop World New York, Sep 2015
  • 2. FINRA - WHAT DO WE DO? Collect and Create • Up to 75 billion events per day • 13 National Exchanges, 5 Reporting Facilities • Reconstruct the market from trillions of events spanning years Detect & Investigate • Identify market manipulations, insider trading, fraud and compliance violations Enforce & Discipline • Ensure rule compliance • Fine and bar broker dealers • Refer matters to the SEC and other authorities 2
  • 3. FINRA’S SURVEILLANCE ALGORITHMS Hundreds of surveillance algorithms against massive amounts of data in multiple products (Equities, Options, etc.) and across multiple exchanges (NASDAQ, NYSE, CBOE, etc.). 3
  • 4. Best Execution Layering Compliance Detecting abusive activity and compliance breaches FINRA’S SURVEILLANCE ALGORITHMS 4
  • 5. FINRA’S SURVEILLANCE ALGORITHMS Dealing with big data before there was “Big Data”. Over 430 batch analytics in the Surveillance suite. “Massively Parallel Processing” methodology used to solve big-data problems in the legacy world. 5
  • 6. PRE-HADOOP DATA ARCHITECTURE  Tiered storage design that struggles to balance Cost, Performance and Flexibility 6
  • 7. PRE-HADOOP PAIN POINTS Data Silos Data distributed physically across MPP appliances, NAS and Tapes, affecting accessibility and efficiency. 7
  • 8. PRE-HADOOP PAIN POINTS Cost Expensive, specialized hardware tuned for CPU, storage and network performance. Proprietary software that comes with relative high cost & vendor lock-in. 8
  • 9. PRE-HADOOP PAIN POINTS Non-Elasticity Can't grow or shrink easily with data volume, bound by the cost of hardware in the appliance and the relative high cost of the software. 9
  • 10. ADDRESSING PAIN POINTS WITH.. HIVE De facto standard for SQL-on-Hadoop Amazon EMR Amazon Elastic Map Reduce – Managed Hadoop Framework Amazon S3 Amazon Simple Storage Service With Practically Infinite Storage 10
  • 11. WHY SQL? Heavily SQL based legacy application that works on source data that’s already available in structured format. Hundreds of thousands of lines of legacy SQL code, iterated and tested rigorously through the years and readily available for easy Hive porting. Developers, testers, data scientists and analysts with deep SQL skills and necessary business acumen already part of FINRA culture. 11
  • 12. HIVE Developed at Facebook and now open-source de facto SQL standard for Hadoop. Been around for seven years, battle tested at scale, and widely used across industries. Powerful abstraction over MapReduce jobs, translates HiveQL into map-reduce jobs that work against distributed dataset. 12
  • 13. HIVE – EXECUTION ENGINES MapReduce Mature batch-processing platform for the petabyte scale! Does not perform well enough for small data or iterative calculations with long data pipelines. Tez Aims to translate complex SQL statements into an optimized, purpose- built data processing graphs. Strikes a balance between performance, throughput, and scalability. Spark Fast in-memory computing, allows you to leverage available memory by fitting in all intermediate data. 13
  • 14. AMAZON(AWS) S3 Cost-effective and durable object storage for a variety of content. Allows separation of storage from compute resources, providing ability to scale each independently in pay-as-you-go pricing model. Meets Hadoop’s file system requirements and integrates well with EMR. 14
  • 15. EMR – ELASTIC MAP REDUCE Managed hadoop framework, easy to deploy and manage Hadoop clusters. Easy, fast, and cost-effective way to distribute and process vast amounts of data across dynamically scalable Amazon EC2 Instances. 15
  • 16. EMR INSTANCE TYPES A wide selection of virtual Linux servers with varying combinations of CPU, memory, storage, and networking capacity. Each instance type includes one or more instance sizes, allowing you to scale your resources to target workload. Available in Spot (your bid price) or On-demand model. [Source – Amazon] 16
  • 17. CLOUD BASED HADOOP ARCHITECTURE 17
  • 18. UNDERLYING DESIGN PATTERN [Source – Amazon] Cluster lives just for the duration of the job, shutdown when the job is done. Persist input and output data in S3. Run multiple jobs in multiple Amazon EMR clusters over the same data set (in S3) without overloading your HDFS nodes. Transient Clusters with S3 as HDFS 18
  • 19. UNDERLYING DESIGN PATTERN – BENEFITS [Source – Amazon] Control your cost, pay for what you use. Minimum maintenance, cluster goes away when the job is done. Persisting data in S3 enables easy reprocessing if the spot Instances are taken away due to an outbid. 19
  • 20. LESSON LEARNED Design your Hive batch analytics with focus on enabling direct data access and maximizing resource utilization. 20
  • 21. HIVE – DESIGNING FOR PERFORMANCE Enable Direct Data Access Partition data along natural query boundaries (e.g. trade date) and process only the data you need. Improve join performance by bucketing and sorting data ahead of time and reduce I/O scans during join process. Use broadcast joins when joining small tables. 21
  • 22. HIVE – DESIGNING FOR PERFORMANCE Tune Hive Configurations Increase parallelism by tuning MapReduce split size to use all the available map slots in the cluster. Increase the replication factor on dimension tables (not for additional resiliency but for better performance). Compress intermediate data and reduce data transfer between mappers/reducers. 22
  • 23. LESSON LEARNED Measure and profile your clusters from the outset and adjust continuously keeping pace with changing data size, processing flow and execution frameworks. 23
  • 24. CLUSTER PROFILING - SAMPLE Batch Process: • Order Audit Trail Report Card Validation Source Dataset: • 1 month of market orders data • Over 100 billion rows • Around 5 terabytes in size • Input stored in S3 Instance Choices: • m1.xlarge (General Purpose) - 4 CPUs, 15 GB RAM, 4 x 420 Disk Storage, High network performance • c3.8xlarge (Compute Optimized) - 32 CPUs, 60 GB RAM, 2 x 320 SSD Storage, High network performance Run Options: Attribute Option-1 Option-2 Option-3 Option-4 Instance Type m1.xl C3.8xl C3.8xl C3.8xl Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized Cluster Size 100 16 32 100 24
  • 25. CLUSTER PROFILE – COMPARISON RESULTS 100 M1s Vs 16 C3s Vs 32 C3s Vs 100 C3s Attribute Option-1 Option-2 Option-3 Option-4 Instance Type m1.xl C3.8xl C3.8xl C3.8xl Instance Classification General Purpose Compute Optimized Compute Optimized Compute Optimized Cluster Size 100 16 32 100 Cost $57.33 $111.86 $127.05 $312.00 Time 3 Hrs 12 mins 8 Hrs 4 Hrs 56 mins 3 Hrs Approximate Spot Price per Instance $0.14 $0.78 $0.66 $0.78 Peak Disk Usage 0.04% 2.10% 1.20% 0.45% Peak Memory Usage 56% 43% 45% 41% Peak CPU Usage 100% 95% 98% 92% 25
  • 26. LESSON LEARNED Abstract execution framework (MR, Tez, Spark) from business logic in your app architecture. This allows switching frameworks, keeping up with Hive community. 26
  • 27. LESSON LEARNED Hive UDFs (User Defined Functions) can solve your complex problems when SQL falls short. 27
  • 28. HIVE UDFS Used when Hive functionality falls short • e.g. window function with ignore nulls – not supported in Hive • e.g. date formatting functions – Hive 1.2 has better support Used when non-procedural SQL can’t accomplish the task • e.g. de-dupe many-to-many time series pairings A->B 10AM, A->C 11AM, D->B 11AM, D->C 11AM, E->C 12PM 28
  • 29. LESSON LEARNED Choose an optimal storage format (row, columnar) and compression type (splittable, space efficient, fast). 29
  • 30. FILE STORAGE – FORMAT AND COMPRESSION [Source – Amazon] STORAGE FORMATS Columnar or Row based storage RCFILE, ORC, PARQUET – Columnar formats, skip loading unwanted columns! COMPRESSION TYPES Reduce the number of bytes written to/read from HDFS Some are fast but offer less space reduction, some are space efficient but slower, some are splittable and some are not. 30 Compression Extension Splittable Encoding/Decoding Speed (Scale 1-4) Space Savings % (scale 1-4) Gzip gz no 1 4 lzo lzo yes if indexed 2 2 bzip2 bz2 yes 3 3 snappy snappy no 4 1
  • 31. LESSON LEARNED Being at the bleeding edge of sql-on-hadoop has its risks, you need the right strategy to mitigate risks/challenges along the way. 31
  • 32. RISKS AND MITIGATION STRATEGIES Hive, released in late 2008, Tez/Spark backend added recently. Traditional platforms like Oracle tested and iterated through four decades. Discovered issues with Hadoop/Hive during migration that are related to compression, storage types and memory management. Performance issues with interactive analytics – Impacts Data Scientists & Business Analysts. 32
  • 33. RISKS AND MITIGATION STRATEGIES Extensive parallel run comparison (apples-to-apples) against legacy production data to identify issues before production roll out. Partnered with Hadoop and Cloud vendors to address Hadoop/Hive issues and feature requests quickly. Push of a button automated regression test suites to kick the tires and test functionality for any minor software increments. Analyzing Presto/Tez to solve interactive analytics problems. 33
  • 34. RISKS AND MITIGATION STRATEGIES Took advantage of cloud elasticity to complete parallel runs against production volume at scale to produce results swiftly. 34
  • 35. END STATE COST Cost savings in infrastructure relative to our legacy environment. Minimal upfront spending, pay-as-you-go pricing model. Variety of machine types and size to choose from based on cost and performance needs. 35
  • 36. END STATE FLEXIBILITY Dynamic infrastructure provides great flexibility with faster reprocessing and testing needs. Simplified software procurement and license management. Ease of exploratory runs without affecting production workload. 36
  • 37. END STATE SCALABILITY Scale out at will easily on high volume days. Cloud elasticity enables running multiple days at scale in parallel. Reprocessing for historic days is completed in hours compared to weeks in our legacy environment. 37