SlideShare a Scribd company logo
1 of 26
Download to read offline
Eric Anderson
Product Manager
@ericmander
Rise of Intermediary APIs
(Beam and Alluxio)
https://goo.gl/Fa95XZ
Google Cloud Platform 2
About Me
Product Manager at Google on Cloud Dataflow
Work closely with the most of the Apache Beam committers
Project Management Committee for Alluxio
Contributed Google Compute Engine support to Alluxio
Originally from Salt Lake City, UT
Father of 3 kids!
Twitter: @ericmander
Google Cloud Platform 3
Intermediary API?
Jesse Anderson (formerly Cloudera) in blog post: Strata+Hadoop Trends
I’m open to a better name if you have ideas
Google Cloud Platform 4
In the beginning...
There was only one approach to data processing
HDFS GFS
Hadoop MapReduce
Open Source Google
Google Cloud Platform 5
In the beginning...
And it required just two APIs, one for job description, one for storage
HDFS API GFS API
Hadoop API MapReduce API
Hadoop MR
HDFS GFS
Open Source Google
Google Cloud Platform 6
Then there was an evolution
But MapReduce was really hard (data processing in assembly language)
MapReduce API
MR
Google Cloud Platform 7
Flume (2010)
Flume was a programming model (API) innovation
(FlumeJava not Apache Flume)
MapReduce API
Flume
MR
Programming
Model
Higher level abstractions
- PCollections (RDDs)
- PTransforms
Directed Acyclical Graphs (DAGs)
Pipeline optimization (fusing)
Google Cloud Platform 8
Millwheel (2013)
Millwheel was an execution model innovation
Introduced a new API, as expected
3 APIs, 2 processing systems!
MapReduce API Millwheel API
Flume
MR Millwheel
Execution Model
Low latency, exactly-once, stream processing
Google Cloud Platform 9
Programming model innovation: Batch and streaming unified
Execution model innovation: Managed batch and service
Dataflow (2015)
MapReduce API Millwheel API
Flume Dataflow SDK
Cloud Dataflow
MR Millwheel
Programming
Model
Execution Model
Google Cloud Platform 10
Dataflow (2015)
MapReduce API Millwheel API
Flume Dataflow SDK
Cloud Dataflow
MR Millwheel
Programming
Model
Execution Model
Programming model innovation: Batch and streaming unified
Execution model innovation: Managed batch and service
Google Cloud Platform 11
“We believe that [...] the Beam model is the future reference programming
model for writing data applications in both stream and batch”
- Kostas Tzoumas, CEO of data Artisans and Flink co-creator
Apache Beam (2016)
Flink API Dataflow SDK
Cloud DataflowFlink Spark
Spark API
Local
Apache Beam
Google Cloud Platform 12
Apache Beam
1. The Beam Programming Model (unifies streaming/batch)
a. Transformations
b. Windowing
c. Watermarks + Triggers
d. Accumulation
2. SDKs for writing Beam pipelines
a. Java (Scala thanks to Spotify)
b. Python
3. Runners for existing distributed processing backends
a. Apache Flink (thanks to data Artisans)
b. Apache Spark (thanks to Cloudera and PayPal)
c. Google Cloud Dataflow (fully managed service from Google)
d. Local runner for testing
e. Other runners in progress: Gear Pump, Apache Apex
Google Cloud Platform 13
There is once again, only one library we need for data processing, except this time:
- It’s easy/expressive
- And we can still choose from the best execution technology
Apache Beam (2017?)
Flink API Dataflow SDK
DataflowFlink Spark
Spark API
Local
Apache Beam
Gear Pump Apache Apex
Google Cloud Platform 14
Coming full circle
There is once again, only one library we need for data processing, except this time:
- It’s easy/expressive
- And we can still choose from the best execution technology
Yet, we’ve tried this before...
Hadoop API MapReduce API
Hadoop MR
Google Cloud Platform 15
Apache Crunch (2012)
Apache Crunch is an open source Flume-like API on Hadoop and now Spark.
MapReduce API
Crunch
Hadoop
Programming
Model
MapReduce
Flume
Google Cloud Platform 16
Apache Crunch (2012)
Interest in Apache Crunch vs Apache Beam
Why? Perhaps...
● Limited portability need / value
● Missed the streaming revolution
● Community support
Google Cloud Platform 17
What about storage?
And it required just two APIs, one for job description, one for storage
HDFS API GFS API
Hadoop API MapReduce API
Hadoop MR
HDFS GFS
Open Source Google
Google Cloud Platform 18
Need for Intermediary Storage API
Again, an explosion of options
No reason to believe this will ever end.
There will always be innovation on storage and the file system
HDFS API Swift API
HDFS SwiftGCS / S3
GCS / S3 APIs
Gluster FS
GlusterFS API
Google Cloud Platform 19
Model for expressing storage lifecycle
There are patterns we want to express:
● Caching
● Retention policy
● ACLs
● Down-tiering old or stale data
Across storage systems:
● Unified namespace
Google Cloud Platform 20
PRD: Intermediate Storage API
1. Model for expressing storage lifecycle
2. Write to the popular storage systems
3. Pluggable APIs extend to other systems
4. Read from the popular processing frameworks
Google Cloud Platform 21
Pluggable under storage
Unified namespace + Tiered storage + Lineage
Supports at least a half dozen
Supports at least a half dozen
Alluxio
1. Model for expressing storage lifecycle
2. Write to the popular storage systems
3. Pluggable APIs extend to other systems
4. Read from the popular processing frameworks
Alibaba OSSSwift HDFSGCS / S3
Alluxio
GlusterFS NFS
HadoopFlink SparkLocal HBase Presto
Google Cloud Platform 22
Survival tests
Survival tests Apache Beam Alluxio
Portability need / value Lots of frameworks with varying
performance profiles
Lots of frameworks and storage
systems with varying
performance profiles
Catch the technology wave Leading stream processing
revolution
Leading in-memory revolution
Community support Top names in data processing Fastest growing contributor
base
Google Cloud Platform 23
My particular excitement about Alluxio
It’s a particularly interesting intermediary API because:
● Data has gravity, Alluxio allows enterprises to adopt tech alongside
legacy storage.
● Alluxio’s unification of sources is valuable within a single job. Beam is
used with one framework at a time, so it’s portable across jobs.
● Alluxio has standalone value from its built-in open source in-memory
filesystem. Beam requires an underly execution engine like Dataflow
Google Cloud Platform 24
Intermediary APIs = Data processing nirvana
Coder:
● Ability to express my data processing job or storage lifecycle logically, independent of
physical constraints.
Deployer:
● Code portability
● Swap in technology at will
System/Technology Creators:
● Easy path to adoption
● Focus on features and performance, not APIs/connectivity
Google Cloud Platform 25
Apache Beam
Alluxio
Stack of the future?
DataflowFlink SparkLocal Gear Pump Apache Apex
Alibaba OSSSwift HDFSGCS / S3 GlusterFS NFS
Google Cloud Platform 26
Questions?https://goo.gl/Fa95XZ

More Related Content

What's hot

What's hot (20)

Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
 
Alluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for DaskAlluxio-FUSE as a data access layer for Dask
Alluxio-FUSE as a data access layer for Dask
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Accelerate Cloud Training with Alluxio
Accelerate Cloud Training with AlluxioAccelerate Cloud Training with Alluxio
Accelerate Cloud Training with Alluxio
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Speeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China UnicomSpeeding Up Spark Performance using Alluxio at China Unicom
Speeding Up Spark Performance using Alluxio at China Unicom
 
Running Spark & Alluxio in Kubernetes
Running Spark & Alluxio in KubernetesRunning Spark & Alluxio in Kubernetes
Running Spark & Alluxio in Kubernetes
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 

Viewers also liked

Viewers also liked (20)

Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 
Accessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified NamespaceAccessing Data Anywhere with Unified Namespace
Accessing Data Anywhere with Unified Namespace
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio Presentation at AMPLab Summer Retreat 2016
Alluxio Presentation at AMPLab Summer Retreat 2016
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
 
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
 
Apache Spot
Apache SpotApache Spot
Apache Spot
 
LinkedIn Endorsements
LinkedIn EndorsementsLinkedIn Endorsements
LinkedIn Endorsements
 
2.7 mbonfim
2.7 mbonfim2.7 mbonfim
2.7 mbonfim
 
Scientix 11th SPNE Brussels 18 Mar 2016: Amigo
Scientix 11th SPNE Brussels 18 Mar 2016: AmigoScientix 11th SPNE Brussels 18 Mar 2016: Amigo
Scientix 11th SPNE Brussels 18 Mar 2016: Amigo
 

Similar to Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 

Similar to Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016 (20)

Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big Data and ML on Google Cloud
Big Data and ML on Google CloudBig Data and ML on Google Cloud
Big Data and ML on Google Cloud
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Powerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hackPowerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hack
 
Exploring Google APIs with Python
Exploring Google APIs with PythonExploring Google APIs with Python
Exploring Google APIs with Python
 
Cloud computing: highlights
Cloud computing: highlightsCloud computing: highlights
Cloud computing: highlights
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

  • 1. Eric Anderson Product Manager @ericmander Rise of Intermediary APIs (Beam and Alluxio) https://goo.gl/Fa95XZ
  • 2. Google Cloud Platform 2 About Me Product Manager at Google on Cloud Dataflow Work closely with the most of the Apache Beam committers Project Management Committee for Alluxio Contributed Google Compute Engine support to Alluxio Originally from Salt Lake City, UT Father of 3 kids! Twitter: @ericmander
  • 3. Google Cloud Platform 3 Intermediary API? Jesse Anderson (formerly Cloudera) in blog post: Strata+Hadoop Trends I’m open to a better name if you have ideas
  • 4. Google Cloud Platform 4 In the beginning... There was only one approach to data processing HDFS GFS Hadoop MapReduce Open Source Google
  • 5. Google Cloud Platform 5 In the beginning... And it required just two APIs, one for job description, one for storage HDFS API GFS API Hadoop API MapReduce API Hadoop MR HDFS GFS Open Source Google
  • 6. Google Cloud Platform 6 Then there was an evolution But MapReduce was really hard (data processing in assembly language) MapReduce API MR
  • 7. Google Cloud Platform 7 Flume (2010) Flume was a programming model (API) innovation (FlumeJava not Apache Flume) MapReduce API Flume MR Programming Model Higher level abstractions - PCollections (RDDs) - PTransforms Directed Acyclical Graphs (DAGs) Pipeline optimization (fusing)
  • 8. Google Cloud Platform 8 Millwheel (2013) Millwheel was an execution model innovation Introduced a new API, as expected 3 APIs, 2 processing systems! MapReduce API Millwheel API Flume MR Millwheel Execution Model Low latency, exactly-once, stream processing
  • 9. Google Cloud Platform 9 Programming model innovation: Batch and streaming unified Execution model innovation: Managed batch and service Dataflow (2015) MapReduce API Millwheel API Flume Dataflow SDK Cloud Dataflow MR Millwheel Programming Model Execution Model
  • 10. Google Cloud Platform 10 Dataflow (2015) MapReduce API Millwheel API Flume Dataflow SDK Cloud Dataflow MR Millwheel Programming Model Execution Model Programming model innovation: Batch and streaming unified Execution model innovation: Managed batch and service
  • 11. Google Cloud Platform 11 “We believe that [...] the Beam model is the future reference programming model for writing data applications in both stream and batch” - Kostas Tzoumas, CEO of data Artisans and Flink co-creator Apache Beam (2016) Flink API Dataflow SDK Cloud DataflowFlink Spark Spark API Local Apache Beam
  • 12. Google Cloud Platform 12 Apache Beam 1. The Beam Programming Model (unifies streaming/batch) a. Transformations b. Windowing c. Watermarks + Triggers d. Accumulation 2. SDKs for writing Beam pipelines a. Java (Scala thanks to Spotify) b. Python 3. Runners for existing distributed processing backends a. Apache Flink (thanks to data Artisans) b. Apache Spark (thanks to Cloudera and PayPal) c. Google Cloud Dataflow (fully managed service from Google) d. Local runner for testing e. Other runners in progress: Gear Pump, Apache Apex
  • 13. Google Cloud Platform 13 There is once again, only one library we need for data processing, except this time: - It’s easy/expressive - And we can still choose from the best execution technology Apache Beam (2017?) Flink API Dataflow SDK DataflowFlink Spark Spark API Local Apache Beam Gear Pump Apache Apex
  • 14. Google Cloud Platform 14 Coming full circle There is once again, only one library we need for data processing, except this time: - It’s easy/expressive - And we can still choose from the best execution technology Yet, we’ve tried this before... Hadoop API MapReduce API Hadoop MR
  • 15. Google Cloud Platform 15 Apache Crunch (2012) Apache Crunch is an open source Flume-like API on Hadoop and now Spark. MapReduce API Crunch Hadoop Programming Model MapReduce Flume
  • 16. Google Cloud Platform 16 Apache Crunch (2012) Interest in Apache Crunch vs Apache Beam Why? Perhaps... ● Limited portability need / value ● Missed the streaming revolution ● Community support
  • 17. Google Cloud Platform 17 What about storage? And it required just two APIs, one for job description, one for storage HDFS API GFS API Hadoop API MapReduce API Hadoop MR HDFS GFS Open Source Google
  • 18. Google Cloud Platform 18 Need for Intermediary Storage API Again, an explosion of options No reason to believe this will ever end. There will always be innovation on storage and the file system HDFS API Swift API HDFS SwiftGCS / S3 GCS / S3 APIs Gluster FS GlusterFS API
  • 19. Google Cloud Platform 19 Model for expressing storage lifecycle There are patterns we want to express: ● Caching ● Retention policy ● ACLs ● Down-tiering old or stale data Across storage systems: ● Unified namespace
  • 20. Google Cloud Platform 20 PRD: Intermediate Storage API 1. Model for expressing storage lifecycle 2. Write to the popular storage systems 3. Pluggable APIs extend to other systems 4. Read from the popular processing frameworks
  • 21. Google Cloud Platform 21 Pluggable under storage Unified namespace + Tiered storage + Lineage Supports at least a half dozen Supports at least a half dozen Alluxio 1. Model for expressing storage lifecycle 2. Write to the popular storage systems 3. Pluggable APIs extend to other systems 4. Read from the popular processing frameworks Alibaba OSSSwift HDFSGCS / S3 Alluxio GlusterFS NFS HadoopFlink SparkLocal HBase Presto
  • 22. Google Cloud Platform 22 Survival tests Survival tests Apache Beam Alluxio Portability need / value Lots of frameworks with varying performance profiles Lots of frameworks and storage systems with varying performance profiles Catch the technology wave Leading stream processing revolution Leading in-memory revolution Community support Top names in data processing Fastest growing contributor base
  • 23. Google Cloud Platform 23 My particular excitement about Alluxio It’s a particularly interesting intermediary API because: ● Data has gravity, Alluxio allows enterprises to adopt tech alongside legacy storage. ● Alluxio’s unification of sources is valuable within a single job. Beam is used with one framework at a time, so it’s portable across jobs. ● Alluxio has standalone value from its built-in open source in-memory filesystem. Beam requires an underly execution engine like Dataflow
  • 24. Google Cloud Platform 24 Intermediary APIs = Data processing nirvana Coder: ● Ability to express my data processing job or storage lifecycle logically, independent of physical constraints. Deployer: ● Code portability ● Swap in technology at will System/Technology Creators: ● Easy path to adoption ● Focus on features and performance, not APIs/connectivity
  • 25. Google Cloud Platform 25 Apache Beam Alluxio Stack of the future? DataflowFlink SparkLocal Gear Pump Apache Apex Alibaba OSSSwift HDFSGCS / S3 GlusterFS NFS
  • 26. Google Cloud Platform 26 Questions?https://goo.gl/Fa95XZ