SlideShare a Scribd company logo
1 of 23
1
Hatayama Hideharu (Hide)
Conference Memo
2
about the conference
- It’s annually held by Hadoop User Group Japan from 2009
(so this time is 5th)
- More than 1200 attendees
- Sponsored by Recruit Technologies
and Cloudera, SAS Institute Japan, Treasure Data,
IBM Japan, MapR Technologies
3
Keynote
4
Future of Data by Doug Cutting
Chief Architect of Cloudera
Creator of Lucene, Nutch, and Hadoop
“Hardware cost will be decreasing, and
value of data will be increasing as before.”
“Hadoop’s functionality will keep growing,
then it might be possible to execute
transactional task on Hadoop”h
Q: I’m glad if Apache version will be standard hogehoge...
A: We’ll keep reflecting good points into Apache ver.
Q: I have a big expectation of Hadoop near real time hogehoge...
A: We’re working on that and if it grows we don’t need to use strom or ...
Q: Are you still working on Lucene?
A: Sorry I’m not. I’m not sure about current Lucene...
5
Future of Spark by Patrick Wendell
Main developer of Spark, working in
Databricks.
Databricks Clound (PaaS Spark cluster)
6
Session
7
BigQuery and the world after MapReduce
Map & Reduce -> BigQuery (Dremel)
GFS -> Colossus
Storage: $0.026 / GB, month Queries: $5 / TB
https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
8
BigQuery and the world after MapReduce
Small JOIN
executed with Broadcast JOIN
One table should be < 8MB
One table is sent to every shard
Big JOIN
executed with shuffling
Both table can be > 8MB
Shuffler doesn’t sort,
just hash partitioninghttps://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
9
BigQuery and the world after MapReduce
BigQuery + Hadoop, BigQuery Streaming, UDF with JavaScript, etc...
https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
10
Batch processing and Stream processing
by SQL
Batch processing
- Hadoop/Hive
- hourly - weekly...
- Highest throughput
- Largest latency
Short Batch processing
- Presto, Impala, Drill
- secondly - hourly...
- Normal throughput
- Small latency
Stream processing
- Storm, Kafka, Esper,
Norikra, Fluentd,...
- secondly - hourly...
- Normal throughput
- Smallest latency
- After query registration,
it runs repeatedly
Norikra
- Internally using Esper
but NO SCHEMA
- Not distributed
- 10K events / sec
on 2cpu (8core)
http://www.slideshare.net/tagomoris/hcj2014-sql
11
Deeper Understanding of Spark’s Internals
Spark Execution Model
1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
RDD: Resilient Distributed Dataset
DAG: Directed acyclic graph (無閉路有向グラフ)
task: data + computation
execute all tasks within a stage before moving on to the next stage
http://www.slideshare.net/hadoopconf/japanese-spark-internalssummit20143
12
Deeper Understanding of Spark’s Internals
Common issue checklist
- Ensure enough partitions for concurrency
(at least 2x number of cores in cluster, and at least 100ms for each task)
(commonly b/w 100 and 10,000)
- Minimize memory consumption (sorting, large key in group-by)
- Minimize amount of data shuffled
- Know the standard library
Memory Problems
Symptoms:
Diagnosis:
Resolution
- Inexplicably bad performance
- Inexplicably executor/machine failures
- Set spark.executor.extraJavaOptions
-XX:+PrintGCDetails
-XX:+HeapDumpOnOutOfMemoryError
- Increase spark.executor.memory
- Increase number of partitions
- Re-evaluate program structure
13
Spark on large Hadoop cluster
and evaluation from the view point of
enterprise Hadoop user and developer
http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
14
Spark on large Hadoop cluster
and evaluation from the view point of
enterprise Hadoop user and developer
http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
15
Spark on large Hadoop cluster
and evaluation from the view point of
enterprise Hadoop user and developer
Please check the evaluation result & summary in the slide
http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
4k core
10TB + RAM
10G network
spark 1.0.0
HDFS (CDH 5.0)
1. wordcount
linear, reduce wouldn’t be bottle neck
2. SparkHdfsLR (LogisticRegression)
cached: (data is small) very fast for 2nd, 3rd,... processing
non-cached: (data is big) same as hadoop
3. GroupByTest (Large shuffle process)
also linear
4. POC of a certain project
key is memory management!!!
for utilizing cache,
rich data format, simple task -> simple data format, complicated task
16
Treasure Data on the YARN
YARN (Yet Another Resource Negotiator)
job tracker ->
resource manager
application master
job history server
task tracker ->
node master
http://www.slideshare.net/ryukobayashi/treasure-data-on-the-yarn-hadoop-conference-japan-2014
17
Treasure Data on the YARN
Many configuration changes are required for MRv1 -> YARN
-> copy HDFS directory of CDH VM or HDP VM
-> use the Ambari or Cloudera Manager
-> use hdp-configuration-utils.py script (http://goo.gl/L2hxyq)
Don't use Apache Hadoop 2.2.0, 2.3.0, or HDP2.0(2.2.0 based)
there is a bug in scheduler (deadlock)
These are OK
Apache Hadoop 2.4.1
CDH 5.0.2(2.3.0 based and patch)
HDP 2.1(2.4.0 based)
18
実践機械学習 - MahoutとSolrを活用した
レコメンデーションにおけるイノベーション
h: users behavior
Ah: user who did h
user oriented recommendation
(*) cannot be pre-processed
-> not efficient & slow
item oriented recommendation
(*) can be pre-processed by
night time offline batch
-> new user’s h processing can
be done in near real time
http://www.slideshare.net/MapR_Japan/mahoutsolr-hadoop-conference-japan-2014
19
実践機械学習 - MahoutとSolrを活用した
レコメンデーションにおけるイノベーション
Multi Modal Recommendation (or cross recomenndation)
e.g.)
movie recommendation
A: search query
B: watched video
-> AtA: query recommendation
-> BtB: video recommendation
-> BtA: video by query recommendation
query: “Paco de Lucia” (Spanish guitarist)
normal result: “hombres de paco” (Spanish TV)
BtA: Spanish classical guitar
Flamenco guitar riff by Van Halen
Dithering
-> switch recommendation result by random noise
Reference
Practical Machine Learning ebook
http://www.mapr.com/practical-machine-learning
20
my impression
21
my impression
Last year: Hybrid or Mixed architecture
Batch Indexing and Near Real Time, keeping things fast
http://www.slideshare.net/lucenerevolution/batch-indexing-near-real-time-keeping-things-fast
22
my impression
This year: Lambda architecture
Lambda Architecture
http://lambda-architecture.net/
23
my impression
Stream processing can’t be replayed or recovered
-> Hybrid processing for fault-tolerance
Stream processing: executes queries in normal
Batch processing: executes recovery queries
Batch processing has a big latency,
Stream processing might not be accurate
-> Hybrid processing for latency-reduction & accuracy
Stream processing: prompt reports
Batch processing: fixed reports
 For keeping good QCD,
the same query is used for both stream & batch processing

More Related Content

What's hot

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014StampedeCon
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLinaro
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2DataWorks Summit
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Hadoop mapreduce performance study on arm cluster
Hadoop mapreduce performance study on arm clusterHadoop mapreduce performance study on arm cluster
Hadoop mapreduce performance study on arm clusterairbots
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 

What's hot (20)

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014GPUs in Big Data - StampedeCon 2014
GPUs in Big Data - StampedeCon 2014
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Hadoop mapreduce performance study on arm cluster
Hadoop mapreduce performance study on arm clusterHadoop mapreduce performance study on arm cluster
Hadoop mapreduce performance study on arm cluster
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 

Viewers also liked

Big Data Timeline
Big Data TimelineBig Data Timeline
Big Data TimelineDeZyre
 
A Big Data Timeline
A Big Data TimelineA Big Data Timeline
A Big Data TimelineBig Cloud
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid WorldTanel Poder
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopSomeshwar Kale
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Emilio Coppa
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cTanel Poder
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 

Viewers also liked (16)

Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Big Data Timeline
Big Data TimelineBig Data Timeline
Big Data Timeline
 
A Big Data Timeline
A Big Data TimelineA Big Data Timeline
A Big Data Timeline
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
SQL in the Hybrid World
SQL in the Hybrid WorldSQL in the Hybrid World
SQL in the Hybrid World
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12c
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

Similar to 20140708hcj

Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Edureka!
 

Similar to 20140708hcj (20)

Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Module01
 Module01 Module01
Module01
 
Spark 101
Spark 101Spark 101
Spark 101
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

20140708hcj

  • 2. 2 about the conference - It’s annually held by Hadoop User Group Japan from 2009 (so this time is 5th) - More than 1200 attendees - Sponsored by Recruit Technologies and Cloudera, SAS Institute Japan, Treasure Data, IBM Japan, MapR Technologies
  • 4. 4 Future of Data by Doug Cutting Chief Architect of Cloudera Creator of Lucene, Nutch, and Hadoop “Hardware cost will be decreasing, and value of data will be increasing as before.” “Hadoop’s functionality will keep growing, then it might be possible to execute transactional task on Hadoop”h Q: I’m glad if Apache version will be standard hogehoge... A: We’ll keep reflecting good points into Apache ver. Q: I have a big expectation of Hadoop near real time hogehoge... A: We’re working on that and if it grows we don’t need to use strom or ... Q: Are you still working on Lucene? A: Sorry I’m not. I’m not sure about current Lucene...
  • 5. 5 Future of Spark by Patrick Wendell Main developer of Spark, working in Databricks. Databricks Clound (PaaS Spark cluster)
  • 7. 7 BigQuery and the world after MapReduce Map & Reduce -> BigQuery (Dremel) GFS -> Colossus Storage: $0.026 / GB, month Queries: $5 / TB https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
  • 8. 8 BigQuery and the world after MapReduce Small JOIN executed with Broadcast JOIN One table should be < 8MB One table is sent to every shard Big JOIN executed with shuffling Both table can be > 8MB Shuffler doesn’t sort, just hash partitioninghttps://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
  • 9. 9 BigQuery and the world after MapReduce BigQuery + Hadoop, BigQuery Streaming, UDF with JavaScript, etc... https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
  • 10. 10 Batch processing and Stream processing by SQL Batch processing - Hadoop/Hive - hourly - weekly... - Highest throughput - Largest latency Short Batch processing - Presto, Impala, Drill - secondly - hourly... - Normal throughput - Small latency Stream processing - Storm, Kafka, Esper, Norikra, Fluentd,... - secondly - hourly... - Normal throughput - Smallest latency - After query registration, it runs repeatedly Norikra - Internally using Esper but NO SCHEMA - Not distributed - 10K events / sec on 2cpu (8core) http://www.slideshare.net/tagomoris/hcj2014-sql
  • 11. 11 Deeper Understanding of Spark’s Internals Spark Execution Model 1. Create DAG of RDDs to represent computation 2. Create logical execution plan for DAG 3. Schedule and execute individual tasks RDD: Resilient Distributed Dataset DAG: Directed acyclic graph (無閉路有向グラフ) task: data + computation execute all tasks within a stage before moving on to the next stage http://www.slideshare.net/hadoopconf/japanese-spark-internalssummit20143
  • 12. 12 Deeper Understanding of Spark’s Internals Common issue checklist - Ensure enough partitions for concurrency (at least 2x number of cores in cluster, and at least 100ms for each task) (commonly b/w 100 and 10,000) - Minimize memory consumption (sorting, large key in group-by) - Minimize amount of data shuffled - Know the standard library Memory Problems Symptoms: Diagnosis: Resolution - Inexplicably bad performance - Inexplicably executor/machine failures - Set spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError - Increase spark.executor.memory - Increase number of partitions - Re-evaluate program structure
  • 13. 13 Spark on large Hadoop cluster and evaluation from the view point of enterprise Hadoop user and developer http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
  • 14. 14 Spark on large Hadoop cluster and evaluation from the view point of enterprise Hadoop user and developer http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
  • 15. 15 Spark on large Hadoop cluster and evaluation from the view point of enterprise Hadoop user and developer Please check the evaluation result & summary in the slide http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014 4k core 10TB + RAM 10G network spark 1.0.0 HDFS (CDH 5.0) 1. wordcount linear, reduce wouldn’t be bottle neck 2. SparkHdfsLR (LogisticRegression) cached: (data is small) very fast for 2nd, 3rd,... processing non-cached: (data is big) same as hadoop 3. GroupByTest (Large shuffle process) also linear 4. POC of a certain project key is memory management!!! for utilizing cache, rich data format, simple task -> simple data format, complicated task
  • 16. 16 Treasure Data on the YARN YARN (Yet Another Resource Negotiator) job tracker -> resource manager application master job history server task tracker -> node master http://www.slideshare.net/ryukobayashi/treasure-data-on-the-yarn-hadoop-conference-japan-2014
  • 17. 17 Treasure Data on the YARN Many configuration changes are required for MRv1 -> YARN -> copy HDFS directory of CDH VM or HDP VM -> use the Ambari or Cloudera Manager -> use hdp-configuration-utils.py script (http://goo.gl/L2hxyq) Don't use Apache Hadoop 2.2.0, 2.3.0, or HDP2.0(2.2.0 based) there is a bug in scheduler (deadlock) These are OK Apache Hadoop 2.4.1 CDH 5.0.2(2.3.0 based and patch) HDP 2.1(2.4.0 based)
  • 18. 18 実践機械学習 - MahoutとSolrを活用した レコメンデーションにおけるイノベーション h: users behavior Ah: user who did h user oriented recommendation (*) cannot be pre-processed -> not efficient & slow item oriented recommendation (*) can be pre-processed by night time offline batch -> new user’s h processing can be done in near real time http://www.slideshare.net/MapR_Japan/mahoutsolr-hadoop-conference-japan-2014
  • 19. 19 実践機械学習 - MahoutとSolrを活用した レコメンデーションにおけるイノベーション Multi Modal Recommendation (or cross recomenndation) e.g.) movie recommendation A: search query B: watched video -> AtA: query recommendation -> BtB: video recommendation -> BtA: video by query recommendation query: “Paco de Lucia” (Spanish guitarist) normal result: “hombres de paco” (Spanish TV) BtA: Spanish classical guitar Flamenco guitar riff by Van Halen Dithering -> switch recommendation result by random noise Reference Practical Machine Learning ebook http://www.mapr.com/practical-machine-learning
  • 21. 21 my impression Last year: Hybrid or Mixed architecture Batch Indexing and Near Real Time, keeping things fast http://www.slideshare.net/lucenerevolution/batch-indexing-near-real-time-keeping-things-fast
  • 22. 22 my impression This year: Lambda architecture Lambda Architecture http://lambda-architecture.net/
  • 23. 23 my impression Stream processing can’t be replayed or recovered -> Hybrid processing for fault-tolerance Stream processing: executes queries in normal Batch processing: executes recovery queries Batch processing has a big latency, Stream processing might not be accurate -> Hybrid processing for latency-reduction & accuracy Stream processing: prompt reports Batch processing: fixed reports  For keeping good QCD, the same query is used for both stream & batch processing