2. 2
about the conference
- It’s annually held by Hadoop User Group Japan from 2009
(so this time is 5th)
- More than 1200 attendees
- Sponsored by Recruit Technologies
and Cloudera, SAS Institute Japan, Treasure Data,
IBM Japan, MapR Technologies
4. 4
Future of Data by Doug Cutting
Chief Architect of Cloudera
Creator of Lucene, Nutch, and Hadoop
“Hardware cost will be decreasing, and
value of data will be increasing as before.”
“Hadoop’s functionality will keep growing,
then it might be possible to execute
transactional task on Hadoop”h
Q: I’m glad if Apache version will be standard hogehoge...
A: We’ll keep reflecting good points into Apache ver.
Q: I have a big expectation of Hadoop near real time hogehoge...
A: We’re working on that and if it grows we don’t need to use strom or ...
Q: Are you still working on Lucene?
A: Sorry I’m not. I’m not sure about current Lucene...
5. 5
Future of Spark by Patrick Wendell
Main developer of Spark, working in
Databricks.
Databricks Clound (PaaS Spark cluster)
7. 7
BigQuery and the world after MapReduce
Map & Reduce -> BigQuery (Dremel)
GFS -> Colossus
Storage: $0.026 / GB, month Queries: $5 / TB
https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
8. 8
BigQuery and the world after MapReduce
Small JOIN
executed with Broadcast JOIN
One table should be < 8MB
One table is sent to every shard
Big JOIN
executed with shuffling
Both table can be > 8MB
Shuffler doesn’t sort,
just hash partitioninghttps://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
9. 9
BigQuery and the world after MapReduce
BigQuery + Hadoop, BigQuery Streaming, UDF with JavaScript, etc...
https://speakerdeck.com/kazunori279/bigquery-and-the-world-after-mapreduce
10. 10
Batch processing and Stream processing
by SQL
Batch processing
- Hadoop/Hive
- hourly - weekly...
- Highest throughput
- Largest latency
Short Batch processing
- Presto, Impala, Drill
- secondly - hourly...
- Normal throughput
- Small latency
Stream processing
- Storm, Kafka, Esper,
Norikra, Fluentd,...
- secondly - hourly...
- Normal throughput
- Smallest latency
- After query registration,
it runs repeatedly
Norikra
- Internally using Esper
but NO SCHEMA
- Not distributed
- 10K events / sec
on 2cpu (8core)
http://www.slideshare.net/tagomoris/hcj2014-sql
11. 11
Deeper Understanding of Spark’s Internals
Spark Execution Model
1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
RDD: Resilient Distributed Dataset
DAG: Directed acyclic graph (無閉路有向グラフ)
task: data + computation
execute all tasks within a stage before moving on to the next stage
http://www.slideshare.net/hadoopconf/japanese-spark-internalssummit20143
12. 12
Deeper Understanding of Spark’s Internals
Common issue checklist
- Ensure enough partitions for concurrency
(at least 2x number of cores in cluster, and at least 100ms for each task)
(commonly b/w 100 and 10,000)
- Minimize memory consumption (sorting, large key in group-by)
- Minimize amount of data shuffled
- Know the standard library
Memory Problems
Symptoms:
Diagnosis:
Resolution
- Inexplicably bad performance
- Inexplicably executor/machine failures
- Set spark.executor.extraJavaOptions
-XX:+PrintGCDetails
-XX:+HeapDumpOnOutOfMemoryError
- Increase spark.executor.memory
- Increase number of partitions
- Re-evaluate program structure
13. 13
Spark on large Hadoop cluster
and evaluation from the view point of
enterprise Hadoop user and developer
http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
14. 14
Spark on large Hadoop cluster
and evaluation from the view point of
enterprise Hadoop user and developer
http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
15. 15
Spark on large Hadoop cluster
and evaluation from the view point of
enterprise Hadoop user and developer
Please check the evaluation result & summary in the slide
http://www.slideshare.net/hadoopxnttdata/apache-spark-nttdatahcj2014
4k core
10TB + RAM
10G network
spark 1.0.0
HDFS (CDH 5.0)
1. wordcount
linear, reduce wouldn’t be bottle neck
2. SparkHdfsLR (LogisticRegression)
cached: (data is small) very fast for 2nd, 3rd,... processing
non-cached: (data is big) same as hadoop
3. GroupByTest (Large shuffle process)
also linear
4. POC of a certain project
key is memory management!!!
for utilizing cache,
rich data format, simple task -> simple data format, complicated task
16. 16
Treasure Data on the YARN
YARN (Yet Another Resource Negotiator)
job tracker ->
resource manager
application master
job history server
task tracker ->
node master
http://www.slideshare.net/ryukobayashi/treasure-data-on-the-yarn-hadoop-conference-japan-2014
17. 17
Treasure Data on the YARN
Many configuration changes are required for MRv1 -> YARN
-> copy HDFS directory of CDH VM or HDP VM
-> use the Ambari or Cloudera Manager
-> use hdp-configuration-utils.py script (http://goo.gl/L2hxyq)
Don't use Apache Hadoop 2.2.0, 2.3.0, or HDP2.0(2.2.0 based)
there is a bug in scheduler (deadlock)
These are OK
Apache Hadoop 2.4.1
CDH 5.0.2(2.3.0 based and patch)
HDP 2.1(2.4.0 based)
18. 18
実践機械学習 - MahoutとSolrを活用した
レコメンデーションにおけるイノベーション
h: users behavior
Ah: user who did h
user oriented recommendation
(*) cannot be pre-processed
-> not efficient & slow
item oriented recommendation
(*) can be pre-processed by
night time offline batch
-> new user’s h processing can
be done in near real time
http://www.slideshare.net/MapR_Japan/mahoutsolr-hadoop-conference-japan-2014
19. 19
実践機械学習 - MahoutとSolrを活用した
レコメンデーションにおけるイノベーション
Multi Modal Recommendation (or cross recomenndation)
e.g.)
movie recommendation
A: search query
B: watched video
-> AtA: query recommendation
-> BtB: video recommendation
-> BtA: video by query recommendation
query: “Paco de Lucia” (Spanish guitarist)
normal result: “hombres de paco” (Spanish TV)
BtA: Spanish classical guitar
Flamenco guitar riff by Van Halen
Dithering
-> switch recommendation result by random noise
Reference
Practical Machine Learning ebook
http://www.mapr.com/practical-machine-learning
21. 21
my impression
Last year: Hybrid or Mixed architecture
Batch Indexing and Near Real Time, keeping things fast
http://www.slideshare.net/lucenerevolution/batch-indexing-near-real-time-keeping-things-fast
23. 23
my impression
Stream processing can’t be replayed or recovered
-> Hybrid processing for fault-tolerance
Stream processing: executes queries in normal
Batch processing: executes recovery queries
Batch processing has a big latency,
Stream processing might not be accurate
-> Hybrid processing for latency-reduction & accuracy
Stream processing: prompt reports
Batch processing: fixed reports
For keeping good QCD,
the same query is used for both stream & batch processing