2. CTO, CODER, ROCK CLIMBER
• current:
• Chief Technology Officer at Bidlab
• previous:
• IT Director at
InternetowyKantor.pl SA
• Software Architect / Project
Manager at Wolters Kluwer
Polska
• find out more (if you care):
• linkedin.com/in/bartoszbogacki
6. HISTORY
• 2013-06-19 Project enters Apache incubation
• 2014-02-19 Project established as an ApacheTop
Level Project.
• 2014-05-30 Spark 1.0.0 released
7. • "Apache Spark is a (lightning-) fast and
general-purpose cluster computing system"
• Engine compatible with Apache Hadoop
• Up to 100x faster than Hadoop
• Less code to write, more elastic
• Active community (117 developers
contributed to release 1.0.0)
8. KEY CONCEPTS
• Spark /YARN / Mesos resources compatible
• HDFS / S3 support built-in
• RDD - Resilient Distribiuted Dataset
• Transformations & Actions
• Written in Scala,API for Java / Scala / Python
28. BROADCASTVARIABLES
• "allow the programmer to keep a read-only
variable cached on each machine rather than
shipping a copy of it with tasks"
Broadcast<int[]> broadcastVar =
sc.broadcast(new int[] {1, 2, 3});
!
broadcastVar.value();
// returns [1, 2, 3]
29. ACCUMULATORS
• variables that are only “added” to through an associative
operation (add())
• only the driver program can read the accumulator’s value
Accumulator<Integer> accum = sc.accumulator(0);
!
sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x ->
accum.add(x));
!
accum.value();
// returns 10
30. SERIALIZATION
• All objects used in your code have to be
serializable
• Otherwise:
org.apache.spark.SparkException: Job aborted: Task not
serializable: java.io.NotSerializableException
31. USE KRYO SERIALIZER
public class MyRegistrator implements KryoRegistrator {
@Override
public void registerClasses(Kryo kryo) {
kryo.register(BidRequest.class);
kryo.register(NotifyRequest.class);
kryo.register(Event.class);
}
}
sparkConfig.set(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConfig.set(
"spark.kryo.registrator", "pl.instream.dsp.offline.MyRegistrator");
sparkConfig.set(
"spark.kryoserializer.buffer.mb", "10");
34. PARTITIONS
• RDD is partitioned
• You may (and probably should) control number
and size of partitions with coalesce() method
• By default 1 input file = 1 partition
41. DSTREAMS
• continuous stream of data, either the input data
stream received from source, or the processed
data stream generated by transforming the input
stream
• represented by a continuous sequence of RDDs
42. INITIALIZING
• SparkConf conf = new
SparkConf().setAppName("Real-Time
Analytics").setMaster("local");
• JavaStreamingContext jssc = new
JavaStreamingContext(conf, new
Duration(TIME_IN_MILIS));;