Apache Spark presentation showing how Spark works internally and how it deals with distributed data.
A comparison with Apache Hadoop is made in order to show the advantages that Apache Spark.
3. Hadoop issues
- Difficult to maintain / install
- Slow due to replication & disk storage
- Need integration for differents tools (machine learning, stream processing)
- "Spending more time learning processing data tool than processing data"
7. Which one should I choose ?
Standalone - simulation / repl
YARN / Mesos - run Spark alongside with other applications / use the richer
resource scheduling capabilities
YARN - Resource manager / node manager
MESOS - Mesos master / mesos agent
YARN - will likely be preinstalled in many Hadoop distributions.
In all cases - it is best to run Spark on the same nodes as HDFS for fast access to
storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most
Hadoop distributions already install YARN and HDFS together.
8. RDD - Resilient Distributed Dataset
Cluster
Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
RDD / 4 partitions (2-4 partition for CPU in your cluster)
Worker Worker Worker
9. RDD - Resilient Distributed Dataset
Parallelized Collections
JavaSparkContext’s parallelize method
(distData) can be operated on in parallel
18. Lifecycle of a Spark program
1) Create some input RDD from external data
2) Lazily transform them (filter(), map())
3) Ask Spark to cache() RDDs that need to be reuse
4) Launch actions (count(), reduce()) to kick off parallel computation
22. Spark SQL
DataFrames can be created from different data sources such as:
- Existing RDDs
- Structured data files
- JSON datasets
- Hive tables
- External databases
SQLContext
HiveContext
(HiveQL)
23. Spark streaming
Streaming data: user activity on websites, monitoring data, server logs, and other
event data
Threat as RDDs
pre-defined interval
(N seconds)
24. Other Spark libraries
- MLib (Machine learning)
- Spark Streaming (Streaming)
- GraphX (distributed graph processing)
- Third party projects
(https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)
26. Security
Authentication via a shared secret
- YARN: spark.authenticate to true / automatically handle generation and
distribution of shared secret
- OTHERS: spark.authenticate.secret for each node
WebUI - java servlet filters (spark.ui.filters)