SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Spark Pipelines in the Cloud
with Alluxio
Gene Pang,Alluxio, Inc.
Spark Summit EU - October 2017
About Me
•  Gene Pang
•  Software engineer @ Alluxio, Inc.
•  Alluxio open source PMC member
•  Ph.D. from AMPLab @ UC Berkeley
•  Worked at Google before UC Berkeley
•  Twitter: @unityxx
•  Github: @gpang
©2017 Alluxio, Inc.All Rights Reserved 2
Outline
©2017 Alluxio, Inc.All Rights Reserved 3
1
2
3
Alluxio Overview
Data Pipelines
Experiments
History of Alluxio
Started at UC Berkeley AMPLab In Summer 2012
•  Originally named as Tachyon
•  Rebranded to Alluxio in early 2016
Open Sourced in 2013
•  Apache License 2.0
•  Latest Release:Alluxio 1.6.0
©2017 Alluxio, Inc.All Rights Reserved 4
5©2017 Alluxio, Inc.All Rights Reserved
Alluxio: Unify Data at Memory Speed
Namespace Unification
Architecture Flexibility
IO Performance
Data Ecosystem with Alluxio
•  Apps only talk to
Alluxio
•  Simple Add/Remove
•  No App Changes
•  In-Memory
Performance
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc.All Rights Reserved 6
Next Gen Analytics with Alluxio
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage
at Memory Speed
ü  Big Data/IoT
ü  AI/ML
ü  Deep Learning
ü  Cloud Migration
ü  Multi Platform
ü  Autonomous
©2017 Alluxio, Inc.All Rights Reserved 7
Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
Now: 600+ Contributors
from 100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc.All Rights Reserved 8
Outline
©2017 Alluxio, Inc.All Rights Reserved 9
1
2
3
Alluxio Overview
Data Pipelines
Experiments
1 0©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Stage 1 Stage 2 Stage 3 Stage 4
Output of stage is input of next stage
1 1©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline
Sharing via common storage
Shared Storage
Stage 1 Stage 2 Stage 3 Stage 4
1 2©2017 Alluxio, Inc.All Rights Reserved
What about
pipelines in the cloud?
1 3©2017 Alluxio, Inc.All Rights Reserved
Data Processing Pipeline in the Cloud
S3 (object store)
Stage 1 Stage 2 Stage 3 Stage 4
1 4©2017 Alluxio, Inc.All Rights Reserved
Sharing data via cloud storage
slows down performance
1 5©2017 Alluxio, Inc.All Rights Reserved
Cloud Pipeline with Alluxio
Sharing via Alluxio memory
S3 (object store)
Stage 1 Stage 2 Stage 3 Stage 4
Sharing Data in the Cloud
Previous stage writes output to storage
Next stage reads input from storage
…
©2017 Alluxio, Inc.All Rights Reserved 1 6
Sharing Data in the Cloud with Alluxio
Previous stage writes output to storage memory
Next stage reads input from storage memory
…
©2017 Alluxio, Inc.All Rights Reserved 1 7
1 8©2017 Alluxio, Inc.All Rights Reserved
Alluxio enables
in-memory data sharing
Faster pipeline performance
Alluxio – Fast Durable Writes
Improves write performance,
without sacrificing fault tolerance
©2017 Alluxio, Inc.All Rights Reserved 1 9
Alluxio – Fast Durable Writes
Synchronously write to replicas in Alluxio memory
Asynchronously write to underlying storage
©2017 Alluxio, Inc.All Rights Reserved 2 0
Alluxio – Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 1
client
Storage
Write is completed
Alluxio – Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 2
client
Storage
Asynchronously
written to
storage
Outline
©2017 Alluxio, Inc.All Rights Reserved 2 3
1
2
3
Alluxio Overview
Data Pipelines
Experiments
Log Pipeline in Amazon Web Services
©2017 Alluxio, Inc.All Rights Reserved 2 4
Generate Parquet Transform Aggregate
Generate: [MapReduce] Create random csv log data
Parquet: [MapReduce] Convert csv to parquet format
Transform: [Spark] Update column values
Aggregate: [Spark] Compute group by / aggregate
Log Pipeline Environment
•  r4.2xlarge instances (61 GB ram, 8 CPUs)
•  1 master, 3 workers
•  Apache Spark 2.2.0
•  Apache Hadoop 2.7.2
•  Alluxio 1.6.0
•  Generate 12 GB of logs
•  Compare AWS S3 vs Alluxio w/ Fast Durable Writes
©2017 Alluxio, Inc.All Rights Reserved 2 5
2 6©2017 Alluxio, Inc.All Rights Reserved
Average Stage Completion Time
2 7©2017 Alluxio, Inc.All Rights Reserved
Pipeline Completion Time
Over 9x speedup!
Alluxio and Pipelines in the Cloud
Alluxio enables in-memory sharing for data
pipelines in the cloud
Alluxio’s Fast Durable Write feature increases
performance without sacrificing fault tolerance
©2017 Alluxio, Inc.All Rights Reserved 2 8
Thank you!
Gene Pang
gene@alluxio.com
Twitter: @unityxx
Twi$er.com/alluxio	
  
Linkedin.com/alluxio	
  
	
  
Website
www.alluxio.com
E-mail
info@alluxio.com
@
Social Media
á
™
©2017 Alluxio, Inc.All Rights Reserved 2 9

Mais conteúdo relacionado

Mais procurados

Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouSpark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...DataWorks Summit/Hadoop Summit
 
High Availability PostgreSQL on OpenShift...and more!
High Availability PostgreSQL on OpenShift...and more!High Availability PostgreSQL on OpenShift...and more!
High Availability PostgreSQL on OpenShift...and more!Jonathan Katz
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiSparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiDatabricks
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...InfluxData
 
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioImproving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioAlluxio, Inc.
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 

Mais procurados (20)

Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
 
High Availability PostgreSQL on OpenShift...and more!
High Availability PostgreSQL on OpenShift...and more!High Availability PostgreSQL on OpenShift...and more!
High Availability PostgreSQL on OpenShift...and more!
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier LuraschiSparklyr: Recap, Updates, and Use Cases with Javier Luraschi
Sparklyr: Recap, Updates, and Use Cases with Javier Luraschi
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using AlluxioImproving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 

Semelhante a Spark Pipelines in the Cloud with Alluxio with Gene Pang

Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanData Con LA
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Accelerating Spark Workloads in a Mesos Environment with Alluxio
Accelerating Spark Workloads in a Mesos Environment with AlluxioAccelerating Spark Workloads in a Mesos Environment with Alluxio
Accelerating Spark Workloads in a Mesos Environment with AlluxioAlluxio, Inc.
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Community
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio, Inc.
 
Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with KubernetesAlluxio, Inc.
 
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationJenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationOleg Nenashev
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioAlluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudAlluxio, Inc.
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Spark Summit
 

Semelhante a Spark Pipelines in the Cloud with Alluxio with Gene Pang (20)

Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Spark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin FanSpark Pipelines in the Cloud with Alluxio by Bin Fan
Spark Pipelines in the Cloud with Alluxio by Bin Fan
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Accelerating Spark Workloads in a Mesos Environment with Alluxio
Accelerating Spark Workloads in a Mesos Environment with AlluxioAccelerating Spark Workloads in a Mesos Environment with Alluxio
Accelerating Spark Workloads in a Mesos Environment with Alluxio
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
 
Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with Kubernetes
 
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems IntegrationJenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
Jenkins Pipeline @ Scale. Building Automation Frameworks for Systems Integration
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 

Mais de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 

Último

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Último (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Spark Pipelines in the Cloud with Alluxio with Gene Pang

  • 1. Spark Pipelines in the Cloud with Alluxio Gene Pang,Alluxio, Inc. Spark Summit EU - October 2017
  • 2. About Me •  Gene Pang •  Software engineer @ Alluxio, Inc. •  Alluxio open source PMC member •  Ph.D. from AMPLab @ UC Berkeley •  Worked at Google before UC Berkeley •  Twitter: @unityxx •  Github: @gpang ©2017 Alluxio, Inc.All Rights Reserved 2
  • 3. Outline ©2017 Alluxio, Inc.All Rights Reserved 3 1 2 3 Alluxio Overview Data Pipelines Experiments
  • 4. History of Alluxio Started at UC Berkeley AMPLab In Summer 2012 •  Originally named as Tachyon •  Rebranded to Alluxio in early 2016 Open Sourced in 2013 •  Apache License 2.0 •  Latest Release:Alluxio 1.6.0 ©2017 Alluxio, Inc.All Rights Reserved 4
  • 5. 5©2017 Alluxio, Inc.All Rights Reserved Alluxio: Unify Data at Memory Speed Namespace Unification Architecture Flexibility IO Performance
  • 6. Data Ecosystem with Alluxio •  Apps only talk to Alluxio •  Simple Add/Remove •  No App Changes •  In-Memory Performance Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface ©2017 Alluxio, Inc.All Rights Reserved 6
  • 7. Next Gen Analytics with Alluxio Native File System Hadoop Compatible File System Native Key-Value Interface Fuse Compatible File System HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface Apps, Data & Storage at Memory Speed ü  Big Data/IoT ü  AI/ML ü  Deep Learning ü  Cloud Migration ü  Multi Platform ü  Autonomous ©2017 Alluxio, Inc.All Rights Reserved 7
  • 8. Fastest Growing Big Data Open Source Projects Fastest Growing open- source project in the big data ecosystem Running in large production clusters Now: 600+ Contributors from 100+ organizations 0 100 200 300 400 500 0 10 20 30 40 45 NumberofContributors Github Open Source Contributors by Month Alluxio Spark Kafka Redis HDFS Cassandra Hive ©2017 Alluxio, Inc.All Rights Reserved 8
  • 9. Outline ©2017 Alluxio, Inc.All Rights Reserved 9 1 2 3 Alluxio Overview Data Pipelines Experiments
  • 10. 1 0©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline Stage 1 Stage 2 Stage 3 Stage 4 Output of stage is input of next stage
  • 11. 1 1©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline Sharing via common storage Shared Storage Stage 1 Stage 2 Stage 3 Stage 4
  • 12. 1 2©2017 Alluxio, Inc.All Rights Reserved What about pipelines in the cloud?
  • 13. 1 3©2017 Alluxio, Inc.All Rights Reserved Data Processing Pipeline in the Cloud S3 (object store) Stage 1 Stage 2 Stage 3 Stage 4
  • 14. 1 4©2017 Alluxio, Inc.All Rights Reserved Sharing data via cloud storage slows down performance
  • 15. 1 5©2017 Alluxio, Inc.All Rights Reserved Cloud Pipeline with Alluxio Sharing via Alluxio memory S3 (object store) Stage 1 Stage 2 Stage 3 Stage 4
  • 16. Sharing Data in the Cloud Previous stage writes output to storage Next stage reads input from storage … ©2017 Alluxio, Inc.All Rights Reserved 1 6
  • 17. Sharing Data in the Cloud with Alluxio Previous stage writes output to storage memory Next stage reads input from storage memory … ©2017 Alluxio, Inc.All Rights Reserved 1 7
  • 18. 1 8©2017 Alluxio, Inc.All Rights Reserved Alluxio enables in-memory data sharing Faster pipeline performance
  • 19. Alluxio – Fast Durable Writes Improves write performance, without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 1 9
  • 20. Alluxio – Fast Durable Writes Synchronously write to replicas in Alluxio memory Asynchronously write to underlying storage ©2017 Alluxio, Inc.All Rights Reserved 2 0
  • 21. Alluxio – Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 1 client Storage Write is completed
  • 22. Alluxio – Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 2 client Storage Asynchronously written to storage
  • 23. Outline ©2017 Alluxio, Inc.All Rights Reserved 2 3 1 2 3 Alluxio Overview Data Pipelines Experiments
  • 24. Log Pipeline in Amazon Web Services ©2017 Alluxio, Inc.All Rights Reserved 2 4 Generate Parquet Transform Aggregate Generate: [MapReduce] Create random csv log data Parquet: [MapReduce] Convert csv to parquet format Transform: [Spark] Update column values Aggregate: [Spark] Compute group by / aggregate
  • 25. Log Pipeline Environment •  r4.2xlarge instances (61 GB ram, 8 CPUs) •  1 master, 3 workers •  Apache Spark 2.2.0 •  Apache Hadoop 2.7.2 •  Alluxio 1.6.0 •  Generate 12 GB of logs •  Compare AWS S3 vs Alluxio w/ Fast Durable Writes ©2017 Alluxio, Inc.All Rights Reserved 2 5
  • 26. 2 6©2017 Alluxio, Inc.All Rights Reserved Average Stage Completion Time
  • 27. 2 7©2017 Alluxio, Inc.All Rights Reserved Pipeline Completion Time Over 9x speedup!
  • 28. Alluxio and Pipelines in the Cloud Alluxio enables in-memory sharing for data pipelines in the cloud Alluxio’s Fast Durable Write feature increases performance without sacrificing fault tolerance ©2017 Alluxio, Inc.All Rights Reserved 2 8
  • 29. Thank you! Gene Pang gene@alluxio.com Twitter: @unityxx Twi$er.com/alluxio   Linkedin.com/alluxio     Website www.alluxio.com E-mail info@alluxio.com @ Social Media á ™ ©2017 Alluxio, Inc.All Rights Reserved 2 9