With the advent of reliable streaming technologies, real-time data pipelines have become a crucial component of any robust data initiative today. Compared to a traditional Hadoop-centric data hub, these real-time stacks provide high-levels of system availability and data integrity coupled with very low latency queries without incurring the overhead of inflexible schemas and batch analysis lag.
Alex Silva demonstrates how to use Kafka, Spark Streaming, Akka, and Hadoop to orchestrate a real-time stack and explains how data flows through this system. This real-time data platform combines a mix of open source technologies and home-grown services aimed at providing a full end-to-end solution, starting from flexible data-ingestion protocols to fast data analysis and queries.
Topics include:
External message providers, which connect to the platform through a data-ingestion service modeled as a robust actor system using Akka and Scala
Routing different backend systems, including Kafka and Druid
Spark Streaming, which is used to perform real-time complex analytical and scientific processing on the data
Exporting data for future processing into Hadoop
Querying and visualization
Photo of Alex Silva
34. abstract class BaseMessageHandler extends Actor with ActorConfigSupport with
ActorLogging with IngestionFlow with ProducerSupport with MessageHandler {
ingest {
case Initialize => {
//nothing required by default
}
case Publish(request) => {
log.info(s"Publish message was not handled by ${self}. Will not join.")
}
case Validate(request) => {
sender ! Validated
}
case Ingest(request) => {
log.warning("Ingest message was not handled by ${self}.")
sender ! HandlerCompleted
}
case Shutdown => {
//nothing required by default
}
case Heartbeat => {
Health.get(self).getChecks
}
}
}
57. override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1.minute) {
case _: ActorInitializationException => akka.actor.SupervisorStrategy.Stop
case _: FailedToSendMessageException => Restart
case _: ProducerClosedException => Restart
case _: NoBrokersForPartitionException => Escalate
case _: KafkaException => Escalate
case _: ConnectException => Escalate
case _: Exception => Escalate
}
val kafkaProducerSupervisor = BackoffSupervisor.props(
Backoff.onFailure(
kafkaProducerProps,
childName = actorName[KafkaProducerActor],
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2
))
58. class KafkaProducerActor extends Actor with LoggingAdapter with ActorConfigSupport with
NotificationSupport[KafkaMessage[Any, Any]] {
import KafkaProducerActor._
implicit val ec = context.dispatcher
override def preRestart(cause: Throwable, message: Option[Any]) = {
//send it to itself again after the exponential delays, no Ack from Kafka
message match {
case Some(rp: RetryingProduce) => {
notifyObservers(KafkaMessageNotDelivered(rp.msg))
val nextBackOff = rp.backOff.nextBackOff
val retry = RetryingProduce(rp.topic, rp.msg)
retry.backOff = nextBackOff
context.system.scheduler.scheduleOnce(nextBackOff.waitTime, self, retry)
}
case Some(produce: Produce) => {
notifyObservers(KafkaMessageNotDelivered(produce.msg))
if (produce.msg.retryOnFailure) {
context.system.scheduler.scheduleOnce(initialDelay, self,
RetryingProduce(produce.topic, produce.msg))
}
}
}
}
}
59. The Error Kernel Pattern
Error Kernel
Per Request
No Processing
Delegation
Ingestion Errors
Time outs
79. Job Manager Endpoint
Configuration
Job repository
Tracking
Lifecycle management
GET /jobs?limit=N - Lists the last N jobs
POST /jobs - Starts a new job; ‘sync=true’ to wait
GET /jobs/<jobId> - Gets the result or status of a job
DELETE /jobs/<jobId> - Kills the job
GET /jobs/<jobId>/config - Gets the job configuration
80. Creating Spark Jobs
#Ad-hoc jobs through hydra - Run-once jobs with transient
curl --data-binary @/etc/local/hydra/video-segment-fx.jar localhost:9091/jars/segment
curl -d "kafka.topic=segment"
'localhost:9091/jobs?appName=segment&classPath=hydra.SegmentJob&sync=false'
{
"status": "STARTED",
"result": {
"jobId": "3156120b-f001-56cf-d22a-b40ebf0a9af1",
"context": "f5ed0ec1-hydra.spark.analytics.segment.SegmentJob"
}
}
81. Persistent Context Jobs
#Required for related jobs
#Create a new context
curl -X POST 'localhost:9091/contexts/video-032116-ctx?num-cpu-cores=10&memory-per-
node=512m'
OK
#Verify creation
curl localhost:9091/contexts
["video-032116-ctx"]
#Run job using the context
curl -d "kafka.topic=segment"
'localhost:9091/jobs?
appName=segment&classPath=hydra.SegmentJob&sync=true&context=video-032116-ctx'
{
"result":{
"active-sessions":24476221
}
}