6. 6
Dr. Frank Munz
•Founded munz & more in 2007
•17 years Oracle Middleware,
Cloud, and Distributed Computing
•Consulting and
High-End Training
•Wrote two Oracle WLS and
one Cloud book
23. Revisited: Map Reduce I/O
munz & more #23
Source: Hadoop Application Architecture Book
24. Spark
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”
• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24
https://stackoverflow.com/questions/31508083/difference-between-
dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
25. RDDs
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.
Supports text, JSON, Apache Parquet, sequence.
Read in
HDFS, Local FS, S3, Hbase
Parallelize
existing Collection
Transform
other RDD
-> RDDs are immutable
27. map(func) Return a new distributed dataset formed
by passing each element of the source
through a function func.
flatMap(func) Similar to map, but each input item can be
mapped to 0 or more output items (so func
should return a Seq rather than a single
item).
reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs,
returns a dataset of (K, V) pairs where the
values for each key are aggregated using
the given reduce function func, which must
be of type (V,V) => V.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs,
returns a dataset of (K, Iterable<V>) pairs.
Transformations
32. Word Count and Histogram
munz & more #32
res =
t.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
res.takeOrdered(5, key = lambda x: -x[1])
45. Event Hub REST Interface
munz & more #45
https://129.151.91.31:1080/restproxy/topics/a12345orderTopic
Service = Topic
46. Interesting to Know
• Event Hub topics are prefixed with ID domain
• With Kafka CLI topics with ID Domain can be
created
• Topics without ID domain are not shown in
OPC console
46
48. TL;DR #bigData #openSource #OPC
OpenSource: entry point to Oracle Big
Data world / Low(er) setup times /
Check for resource usage & limits in
Big Data OPC / BDCS-CE: managed
Hadoop, Hive, Spark + Event hub:
Kafka / Attend a hands-on workshop! /
Next level: Oracle Big Data tools
@EdelweissK
@FrankMunz