Alluxio Webinar
April 6, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
3. Enterprises have organically created a legacy of data silos through short term focused projects,
mergers & acquisitions!
Data Lakes and Silos Abound
▪ Data lakes and critical data are often in a silo and challenging to access
▪ Consolidation of data lakes and silos are expensive and slow to complete
▪ Compute is everywhere
Teradata POSIX
file
Internal
apps
Public
Clouds
S3 Object HDFS 1
HDFS 2
4. 4 BigTrends Driving the Need for a New Architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
5. ▪ Data volume, velocity and variety are avalanching - data doubles every two years*
▪ The business knows that data analytics/ML models allow them to compete
effectively*
▪ Object is becoming the new data lake
▪ The enterprise is a multi- site - cloud world and will remain so for some time
▪ Technical leadership wants the agility to run applications anywhere
▪ IT wants to offer a cloud like experience to their users
▪ Technical organizations struggle to keep up with data ingest and business demands
* “The Fourth Industrial Revolution”, by Klaus Schwab
Market Summary
6. Alluxio’sVision
"Orchestrate data for analytics and machine learning to enable
companies to grow and be agile regardless of where their data
and compute are located."
Quick start cloud adoption that optimizes cost that yields 2X –
6X analytics acceleration for –
● Fraud protection
● Research for treatments for diseases like COVID-19
● Uptime for all industrial and digital technologies we depend on
7. What is Data Orchestration?
A platform that brings your data closer to compute across
clusters, regions and clouds.
8. Alluxio
Companies use Alluxio to -
• Gain faster research, analytic and ML results that matter to the business by 2X
– 6X using Alluxio advanced caching technology for multi-site/hybrid cloud
• Enable agility with no programming to use different compute or storage – API
translations - Hadoop to cloud or on prem S3
• Dramatically lower OpEx by eliminating data management and egress costs –
Alluxio unified namespace,API translations and policy driven data movement
• Drop into existing on prem and clouds with zero programming
9. ALLUXIO 9
COMPANIES USING ALLUXIO
INTERNET
PUBLIC CLOUD PROVIDERS
GENERAL
E-COMMERCE
OTHERS
TECHNOLOGY FINANCIAL SERVICES
TELCO & MEDIA
LEARN MORE
11. Alluxio Common Use Cases
Burst big data workloads in
hybrid cloud environments
Same instance
/ container
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
or or
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Alluxio
Presto
Hive
Alluxio
Hive
Alluxio
Hive
Alluxio
Hive/Spark/Presto,
TensorFlow
Alluxio
Alluxio
Spark
Alluxio
Alluxio
Spark
Alluxio
Spark
Spark
12. Problem: HDFS cluster is compute-
bound & complex to maintain
Google Cloud Platform
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network
latency and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute
Capacity
• Offload on-prem cluster (both compute & I/O)
• Manage working set, not FULL set of data
• Local performance
• Automatic synchronization with on-prem changes
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
GCS
Our Solution: “Zero-Copy Burst”
12
14. Alluxio at Walmart
14
Architectural Components
● 2x Performance
For range queries
● High Concurrency
With Alluxio
● Cost Reduction
With Half the compute costs or 2x
compute capacity for the same
environment
● Auto-Scaling
To maintain a min number of Alluxio
workers
15. Alluxio at Adobe
Primary DC with large Hadoop Cluster out
of space, ad hoc SQL workloads
exponentially growing as analyst
headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
15
Cross Data Center Access
16. Alluxio at Electronic Arts (EA)
Single Cloud with AWS
Learn More
Upto 6x Performance
When handling a large
number of small files
Elastic Compute
To Reduce Infrastructure
Costs
Reduce S3 Costs
By eliminating S3 access
operations
17. Machine Learning - Alibaba
Learn More
97% of theoretical upper
limit of training
performance
30,000 images/second
with Alluxio. 13,478
images/second with SSD
41% costs savings
19. Unified
Namespace
Bring all files and
objects into a single
interface
Interact with data
using any API Accelerate & tier
data transparently
API
Translation
Intelligent
Caching
Multi-tiering
Alluxio - Key Innovations
20. Data Accessibility (via popular APIs and API Translation)
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST API
FUSE Interface
HDFS Driver Swift Driver
S3 Driver NFS Driver
21. Data Locality with Intelligent Multi-tiering
Local Performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
On-premises
Public Cloud
21
22. Unified Namespace
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
22
23. Policy Driven Data Migration
Migrate Data to Cloud Storage based on Access Policies
hdfs://host:port/directory/
Reports Sales
• Single Alluxio path backed by multiple storage systems
• Example policy: Migrate data older than 7 days from HDFS to S3
23
25. Alluxio Catalog Service
Hive Metastore
Hive Under Database
Functionality
Manages metadata for structured data
Abstracts other database catalogs as
Under Database (UDB)
Benefits
Schema-aware optimizations
Simple deployment
25
Alluxio Catalog Service
26. Transform data to be compute-optimized
independent of the storage format
Coalesce Format Conversion
parquet
csv
26
Transformation Service
27. Attached existing Hive database into Alluxio Catalog
Alluxio Catalog served table metadata for Presto
Transformed store_sales by coalescing and converting CSV to Parquet
Presto Without
Alluxio
20s
Alluxio
Transformations
7s
Alluxio
Transformations With
Caching
3s
27
Example Results
33. How can Alluxio help you?
• Did you learn what Alluxio Data Orchestration is?
• Do you have a use case Alluxio can accelerate?
For follow up questions and to discuss your situation, please contact Peter at
peter@alluxio.com