SlideShare uma empresa Scribd logo
1 de 18
1© Cloudera, Inc. All rights reserved.
Apache Spark: Usage and
Roadmap in Hadoop
Jai Ranganathan
2© Cloudera, Inc. All rights reserved.
Spark will replace MapReduce
To become the standard execution engine for Hadoop
3© Cloudera, Inc. All rights reserved.
The Future of Data Processing on Hadoop
Spark complemented by specialized fit-for-purpose engines
General Data Processing
w/Spark
Fast Batch Processing, Machine Learning,
and Stream Processing
Analytic
Database
w/Impala
Low-Latency
Massively Concurrent
Queries
Full-Text Search w/Solr
Querying textual data
On-Disk Processing
w/MapReduce
Jobs at extreme scale and
extremely disk IO intensive
Shared:
• Data Storage
• Metadata
• Resource
Management
• Administration
• Security
• Governance
4© Cloudera, Inc. All rights reserved.
Cloudera Leading the Spark Movement
2013 2014 2015 2016
Identified Spark’s
early potential
Ships and
Supports
Spark with
CDH 4.4
Spark on YARN
integration
Announces initiative to
make Spark the standard
execution engine
Launches first
Spark training
Added security
integration
Cloudera engineers
publish O’Reilly Spark
book
Leading effort to
further performance,
usability, and
enterprise-readiness
5© Cloudera, Inc. All rights reserved.
Community Initiative: Spark Supersedes MapReduce
Stage 1
• Crunch on Spark
• Search on Spark
Stage 2
• Hive on Spark (beta)
• Spark on HBase (beta)
Stage 3
• Pig on Spark (alpha)
• Sqoop on Spark
Community development to port components to Spark:
6© Cloudera, Inc. All rights reserved.
Cloudera Customer Use Cases
Core Spark Spark Streaming
• Portfolio Risk Analysis
• ETL Pipeline Speed-Up
• 20+ years of stock dataFinancial
Services
Health
• Identify disease-causing genes
in the full human genome
• Calculate Jaccard scores on
health care data sets
ERP
• Optical Character Recognition and
Bill Classification
• Trend analysis
• Document classification (LDA)
• Fraud analyticsData
Services
1010
• Online Fraud Detection
Financial
Services
Health
• Incident Prediction for Sepsis
Retail
• Online Recommendation Systems
• Real-Time Inventory Management
Ad Tech
• Real-Time Ad Performance Analysis
7© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
8© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Hadoop Integration
• Spark-on-YARN integration
• Shares data, metadata,
administration, security, &
governance
STORAGE
HDFS, HBase
RESOURCE MANAGEMENT
YARN
Spark Impala MR Others
Spark
Streamin
g
MLlib SparkSQL GraphX
Data-
frames
SparkR
9© Cloudera, Inc. All rights reserved.
Logistic Regression Performance
(Data Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
# of Iterations
MapReduce
Spark
110 s/iteration
First iteration = 80s
Further iterations 1s
due to caching
10© Cloudera, Inc. All rights reserved.
Apache Spark Streaming
What is it?
• Run continuous processing of data using
Spark’s core API
• Extends Spark concepts to fault-tolerant,
transformable streams
• Adds “rolling window” operations
• Example: Compute rolling averages or counts
for data over last five minutes
Benefits:
• Reuse knowledge and code in both contexts
• Same programming paradigm for streaming and
batch
• Simplicity of development
• High-level API with automatic DAG generation
• Excellent throughput
• Scale easily to support large volumes of data
ingest
• Combine elements like MLlib and Oryx into
streaming applications
Common Use Cases:
• “On-the-fly” ETL as data is ingested into
Hadoop/HDFS
• Detect anomalous behavior and trigger alerts
• Continuous reporting of summary metrics for
incoming data
11© Cloudera, Inc. All rights reserved.
Spark Streaming Architectures
Data Sources
Ingest
Integration
Layer
• Flume
• Kafka
Spark Stream Processing
Data Prep
Aggregation /
Scoring
HDFS
Spark Long-Term Analytics/
Model Building
HBase
Real-Time Result
Serving
12© Cloudera, Inc. All rights reserved.
SparkSQL + Dataframes
Machine Learning Applications
• Goal:
• Spark/Java Developers and Data
Scientists can inline SQL into Spark apps
• Designed for:
• Ease of development for Spark
developers
• Handful of concurrent Spark jobs
• Strengths:
• Ease of embedding SQL into Java or Scala
applications
• SQL for common functionality in
developer flow (eg. aggregations, filters,
samples)
13© Cloudera, Inc. All rights reserved.
Execution Pipeline
SQL AST Logical Plan
Optimized
Logical Plan
Logical
Plan
Physical
Plans
CBO
Selected
Plan
RDDsRDDsRDDs
Dataframes
14© Cloudera, Inc. All rights reserved.
Uniting Spark and Hadoop
The One Platform Initiative
Management
Leverage Hadoop-native
resource management.
Security
Full support for Hadoop security
and beyond.
Scale
Enable 10k-node clusters.
Streaming
Support for 80% of common stream
processing workloads.
15© Cloudera, Inc. All rights reserved.
Management Security Scale Streaming
• Spark on YARN Integration
• HBase integration
• Improved metrics for
monitoring/troubleshooting
• Dynamic Resource Allocation
• Spark on YARN:
• Container resizing
• Dynamic Resource
Allocation for Streaming
• Simplified resource
configuration
• Improved WebUI for
debugging
• Improved metrics for visibility
into resource utilization
• Smart auto-tuning of job
parameters
• Kerberos Integration
• HDFS Sync (Sentry)
• Secure data at rest
• Secure data over the wire
• Audit/Lineage (Navigator)
• Spark PCI compliance
• Integration with Intel’s
advanced encryption libraries
• Enable column and view level
security
• Revamp Scheduler handling of
node failure
• Sort based shuffle
improvements
• Task Scheduling based on
HDFS data locality and caching
• Scheduler improvements for
performance at scale
• Stress test at scale with mixed
multi-tenant workloads
• HDFS DDM Integration
• Dynamic resource utilization &
prioritization
• Scale Spark History Server for
1000s of jobs
• Zero Data Loss with Spark
Streaming Resilience
• Flume integration
• Kafka integration
• SQL semantics for expressing
streaming jobs (Business
Users)
• New streaming specific API
extensions
• Streaming application
management (pause, update,
redeploy) via CM
• Optimized state updates:
efficient point lookups and
delta updates
Detailed Roadmap: One Platform Initiative
= Completed Work
= Planned Future Work
16© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog
• cloudera.com/spark
• Get Trained
• Cloudera Spark Training
• Try it Out
• Cloudera Live Spark Tutorial
17© Cloudera, Inc. All rights reserved.
Try It With Cloudera Live
cloudera.com/live
Featuring tutorials on:
CDH
18© Cloudera, Inc. All rights reserved.
Thank You
Jairam Ranganathan
jairam@cloudera.com

Mais conteúdo relacionado

Mais procurados

Asiatic Marketing Communications Limited Internship Report
Asiatic Marketing Communications Limited Internship ReportAsiatic Marketing Communications Limited Internship Report
Asiatic Marketing Communications Limited Internship ReportAhsan Habib
 
wipro consumer care and lighting, SIP presentation
wipro consumer care and lighting, SIP presentationwipro consumer care and lighting, SIP presentation
wipro consumer care and lighting, SIP presentationAbhishek Tiwari
 
WIPRO Fundamental Analysis
WIPRO Fundamental AnalysisWIPRO Fundamental Analysis
WIPRO Fundamental AnalysisDeepak Kumar
 
Impact of Digital Marketing as a Marketing Tool in India
Impact of Digital Marketing as a Marketing Tool in IndiaImpact of Digital Marketing as a Marketing Tool in India
Impact of Digital Marketing as a Marketing Tool in IndiaSandip P.
 
Tata Motors CSR Activity PPT 2015-2016
Tata Motors CSR Activity PPT 2015-2016Tata Motors CSR Activity PPT 2015-2016
Tata Motors CSR Activity PPT 2015-2016Rahul Gulaganji
 
Project report on mahindra and mahindra (1)
Project report on mahindra and mahindra (1)Project report on mahindra and mahindra (1)
Project report on mahindra and mahindra (1)mehrajkhan16
 
Summer Internship Project MBA at Britannia Industry Limited
Summer Internship Project MBA at Britannia Industry LimitedSummer Internship Project MBA at Britannia Industry Limited
Summer Internship Project MBA at Britannia Industry LimitedDalpat Parihar
 
Swot analysis of hul
Swot analysis of hulSwot analysis of hul
Swot analysis of hulomgogna
 
Kubota KH41 Excavator Service Repair Manual
Kubota KH41 Excavator Service Repair ManualKubota KH41 Excavator Service Repair Manual
Kubota KH41 Excavator Service Repair Manualuekdjkm jksemmd
 
Analysis of-consumer-perception-on-dabur-honey
Analysis of-consumer-perception-on-dabur-honeyAnalysis of-consumer-perception-on-dabur-honey
Analysis of-consumer-perception-on-dabur-honeyAbhisheK Kumar Rajoria
 
MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...
MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...
MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...Akaresh Jose Kaviyil JY
 
Internship Report on EFU Life Assuarance ltd.
Internship Report on EFU Life Assuarance ltd.Internship Report on EFU Life Assuarance ltd.
Internship Report on EFU Life Assuarance ltd.Wish Mrt'xa
 

Mais procurados (17)

Asiatic Marketing Communications Limited Internship Report
Asiatic Marketing Communications Limited Internship ReportAsiatic Marketing Communications Limited Internship Report
Asiatic Marketing Communications Limited Internship Report
 
Marketing Mix
Marketing MixMarketing Mix
Marketing Mix
 
Icici sec
Icici secIcici sec
Icici sec
 
wipro consumer care and lighting, SIP presentation
wipro consumer care and lighting, SIP presentationwipro consumer care and lighting, SIP presentation
wipro consumer care and lighting, SIP presentation
 
Csr infosys
Csr infosysCsr infosys
Csr infosys
 
WIPRO Fundamental Analysis
WIPRO Fundamental AnalysisWIPRO Fundamental Analysis
WIPRO Fundamental Analysis
 
Impact of Digital Marketing as a Marketing Tool in India
Impact of Digital Marketing as a Marketing Tool in IndiaImpact of Digital Marketing as a Marketing Tool in India
Impact of Digital Marketing as a Marketing Tool in India
 
Tata Motors CSR Activity PPT 2015-2016
Tata Motors CSR Activity PPT 2015-2016Tata Motors CSR Activity PPT 2015-2016
Tata Motors CSR Activity PPT 2015-2016
 
Project report on mahindra and mahindra (1)
Project report on mahindra and mahindra (1)Project report on mahindra and mahindra (1)
Project report on mahindra and mahindra (1)
 
SIP report executive summary
SIP report executive summarySIP report executive summary
SIP report executive summary
 
Summer Internship Project MBA at Britannia Industry Limited
Summer Internship Project MBA at Britannia Industry LimitedSummer Internship Project MBA at Britannia Industry Limited
Summer Internship Project MBA at Britannia Industry Limited
 
Swot analysis of hul
Swot analysis of hulSwot analysis of hul
Swot analysis of hul
 
Kubota KH41 Excavator Service Repair Manual
Kubota KH41 Excavator Service Repair ManualKubota KH41 Excavator Service Repair Manual
Kubota KH41 Excavator Service Repair Manual
 
Analysis of-consumer-perception-on-dabur-honey
Analysis of-consumer-perception-on-dabur-honeyAnalysis of-consumer-perception-on-dabur-honey
Analysis of-consumer-perception-on-dabur-honey
 
Project on ratios
Project on ratiosProject on ratios
Project on ratios
 
MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...
MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...
MINOR PROJECT REPORT ON MARKET POTENTIAL OF RICE POWDER BY JAYABHARATH MODERN...
 
Internship Report on EFU Life Assuarance ltd.
Internship Report on EFU Life Assuarance ltd.Internship Report on EFU Life Assuarance ltd.
Internship Report on EFU Life Assuarance ltd.
 

Destaque

初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜
初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜
初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜Tanaka Yuichi
 
Spark/MapReduceの 機械学習ライブラリ比較検証
Spark/MapReduceの 機械学習ライブラリ比較検証Spark/MapReduceの 機械学習ライブラリ比較検証
Spark/MapReduceの 機械学習ライブラリ比較検証Recruit Technologies
 
Sparkストリーミング検証
Sparkストリーミング検証Sparkストリーミング検証
Sparkストリーミング検証BrainPad Inc.
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) hamaken
 
JAWS-DAYS 2015 / 北海道 x 農業 x クラウド
JAWS-DAYS 2015 / 北海道 x 農業 x クラウドJAWS-DAYS 2015 / 北海道 x 農業 x クラウド
JAWS-DAYS 2015 / 北海道 x 農業 x クラウドTakehito Tanabe
 
東急ハンズのクラウドデザインパターン アーキテクチャー編
東急ハンズのクラウドデザインパターン アーキテクチャー編東急ハンズのクラウドデザインパターン アーキテクチャー編
東急ハンズのクラウドデザインパターン アーキテクチャー編一成 田部井
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep LearningAsim Jalis
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計Cloudera Japan
 
Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016Cloudera Japan
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~sugiyama koki
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)NTT DATA OSS Professional Services
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkTakanori Suzuki
 

Destaque (20)

初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜
初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜
初めてのSpark streaming 〜kafka+sparkstreamingの紹介〜
 
Spark/MapReduceの 機械学習ライブラリ比較検証
Spark/MapReduceの 機械学習ライブラリ比較検証Spark/MapReduceの 機械学習ライブラリ比較検証
Spark/MapReduceの 機械学習ライブラリ比較検証
 
Sparkストリーミング検証
Sparkストリーミング検証Sparkストリーミング検証
Sparkストリーミング検証
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
 
Mesos framework API v1
Mesos framework API v1Mesos framework API v1
Mesos framework API v1
 
JAWS-DAYS 2015 / 北海道 x 農業 x クラウド
JAWS-DAYS 2015 / 北海道 x 農業 x クラウドJAWS-DAYS 2015 / 北海道 x 農業 x クラウド
JAWS-DAYS 2015 / 北海道 x 農業 x クラウド
 
東急ハンズのクラウドデザインパターン アーキテクチャー編
東急ハンズのクラウドデザインパターン アーキテクチャー編東急ハンズのクラウドデザインパターン アーキテクチャー編
東急ハンズのクラウドデザインパターン アーキテクチャー編
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計
 
Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache Flink
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 

Semelhante a Apache Spark: Usage and Roadmap in Hadoop

Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 

Semelhante a Apache Spark: Usage and Roadmap in Hadoop (20)

Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 

Mais de Cloudera Japan

Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Cloudera Japan
 
機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介Cloudera Japan
 
HDFS Supportaiblity Improvements
HDFS Supportaiblity ImprovementsHDFS Supportaiblity Improvements
HDFS Supportaiblity ImprovementsCloudera Japan
 
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とはCloudera Japan
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Cloudera Japan
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Cloudera Japan
 
HBase Across the World #LINE_DM
HBase Across the World #LINE_DMHBase Across the World #LINE_DM
HBase Across the World #LINE_DMCloudera Japan
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera Japan
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelCloudera Japan
 
Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側Cloudera Japan
 
Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017Cloudera Japan
 
先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方Cloudera Japan
 
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017Cloudera Japan
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017Cloudera Japan
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan
 
Hue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejpHue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejpCloudera Japan
 
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017Cloudera Japan
 
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Japan
 
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016Cloudera Japan
 
大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016Cloudera Japan
 

Mais de Cloudera Japan (20)

Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
 
機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介
 
HDFS Supportaiblity Improvements
HDFS Supportaiblity ImprovementsHDFS Supportaiblity Improvements
HDFS Supportaiblity Improvements
 
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは分散DB Apache KuduのアーキテクチャDBの性能と一貫性を両立させる仕組み「HybridTime」とは
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
 
HBase Across the World #LINE_DM
HBase Across the World #LINE_DMHBase Across the World #LINE_DM
HBase Across the World #LINE_DM
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennight
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning model
 
Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側
 
Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017
 
先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方
 
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Hue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejpHue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejp
 
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
 
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
 
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
 
大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016
 

Último

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Apache Spark: Usage and Roadmap in Hadoop

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Spark: Usage and Roadmap in Hadoop Jai Ranganathan
  • 2. 2© Cloudera, Inc. All rights reserved. Spark will replace MapReduce To become the standard execution engine for Hadoop
  • 3. 3© Cloudera, Inc. All rights reserved. The Future of Data Processing on Hadoop Spark complemented by specialized fit-for-purpose engines General Data Processing w/Spark Fast Batch Processing, Machine Learning, and Stream Processing Analytic Database w/Impala Low-Latency Massively Concurrent Queries Full-Text Search w/Solr Querying textual data On-Disk Processing w/MapReduce Jobs at extreme scale and extremely disk IO intensive Shared: • Data Storage • Metadata • Resource Management • Administration • Security • Governance
  • 4. 4© Cloudera, Inc. All rights reserved. Cloudera Leading the Spark Movement 2013 2014 2015 2016 Identified Spark’s early potential Ships and Supports Spark with CDH 4.4 Spark on YARN integration Announces initiative to make Spark the standard execution engine Launches first Spark training Added security integration Cloudera engineers publish O’Reilly Spark book Leading effort to further performance, usability, and enterprise-readiness
  • 5. 5© Cloudera, Inc. All rights reserved. Community Initiative: Spark Supersedes MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark (beta) • Spark on HBase (beta) Stage 3 • Pig on Spark (alpha) • Sqoop on Spark Community development to port components to Spark:
  • 6. 6© Cloudera, Inc. All rights reserved. Cloudera Customer Use Cases Core Spark Spark Streaming • Portfolio Risk Analysis • ETL Pipeline Speed-Up • 20+ years of stock dataFinancial Services Health • Identify disease-causing genes in the full human genome • Calculate Jaccard scores on health care data sets ERP • Optical Character Recognition and Bill Classification • Trend analysis • Document classification (LDA) • Fraud analyticsData Services 1010 • Online Fraud Detection Financial Services Health • Incident Prediction for Sepsis Retail • Online Recommendation Systems • Real-Time Inventory Management Ad Tech • Real-Time Ad Performance Analysis
  • 7. 7© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 8. 8© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop Hadoop Integration • Spark-on-YARN integration • Shares data, metadata, administration, security, & governance STORAGE HDFS, HBase RESOURCE MANAGEMENT YARN Spark Impala MR Others Spark Streamin g MLlib SparkSQL GraphX Data- frames SparkR
  • 9. 9© Cloudera, Inc. All rights reserved. Logistic Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) # of Iterations MapReduce Spark 110 s/iteration First iteration = 80s Further iterations 1s due to caching
  • 10. 10© Cloudera, Inc. All rights reserved. Apache Spark Streaming What is it? • Run continuous processing of data using Spark’s core API • Extends Spark concepts to fault-tolerant, transformable streams • Adds “rolling window” operations • Example: Compute rolling averages or counts for data over last five minutes Benefits: • Reuse knowledge and code in both contexts • Same programming paradigm for streaming and batch • Simplicity of development • High-level API with automatic DAG generation • Excellent throughput • Scale easily to support large volumes of data ingest • Combine elements like MLlib and Oryx into streaming applications Common Use Cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detect anomalous behavior and trigger alerts • Continuous reporting of summary metrics for incoming data
  • 11. 11© Cloudera, Inc. All rights reserved. Spark Streaming Architectures Data Sources Ingest Integration Layer • Flume • Kafka Spark Stream Processing Data Prep Aggregation / Scoring HDFS Spark Long-Term Analytics/ Model Building HBase Real-Time Result Serving
  • 12. 12© Cloudera, Inc. All rights reserved. SparkSQL + Dataframes Machine Learning Applications • Goal: • Spark/Java Developers and Data Scientists can inline SQL into Spark apps • Designed for: • Ease of development for Spark developers • Handful of concurrent Spark jobs • Strengths: • Ease of embedding SQL into Java or Scala applications • SQL for common functionality in developer flow (eg. aggregations, filters, samples)
  • 13. 13© Cloudera, Inc. All rights reserved. Execution Pipeline SQL AST Logical Plan Optimized Logical Plan Logical Plan Physical Plans CBO Selected Plan RDDsRDDsRDDs Dataframes
  • 14. 14© Cloudera, Inc. All rights reserved. Uniting Spark and Hadoop The One Platform Initiative Management Leverage Hadoop-native resource management. Security Full support for Hadoop security and beyond. Scale Enable 10k-node clusters. Streaming Support for 80% of common stream processing workloads.
  • 15. 15© Cloudera, Inc. All rights reserved. Management Security Scale Streaming • Spark on YARN Integration • HBase integration • Improved metrics for monitoring/troubleshooting • Dynamic Resource Allocation • Spark on YARN: • Container resizing • Dynamic Resource Allocation for Streaming • Simplified resource configuration • Improved WebUI for debugging • Improved metrics for visibility into resource utilization • Smart auto-tuning of job parameters • Kerberos Integration • HDFS Sync (Sentry) • Secure data at rest • Secure data over the wire • Audit/Lineage (Navigator) • Spark PCI compliance • Integration with Intel’s advanced encryption libraries • Enable column and view level security • Revamp Scheduler handling of node failure • Sort based shuffle improvements • Task Scheduling based on HDFS data locality and caching • Scheduler improvements for performance at scale • Stress test at scale with mixed multi-tenant workloads • HDFS DDM Integration • Dynamic resource utilization & prioritization • Scale Spark History Server for 1000s of jobs • Zero Data Loss with Spark Streaming Resilience • Flume integration • Kafka integration • SQL semantics for expressing streaming jobs (Business Users) • New streaming specific API extensions • Streaming application management (pause, update, redeploy) via CM • Optimized state updates: efficient point lookups and delta updates Detailed Roadmap: One Platform Initiative = Completed Work = Planned Future Work
  • 16. 16© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog • cloudera.com/spark • Get Trained • Cloudera Spark Training • Try it Out • Cloudera Live Spark Tutorial
  • 17. 17© Cloudera, Inc. All rights reserved. Try It With Cloudera Live cloudera.com/live Featuring tutorials on: CDH
  • 18. 18© Cloudera, Inc. All rights reserved. Thank You Jairam Ranganathan jairam@cloudera.com