SlideShare uma empresa Scribd logo
1 de 27
1© Copyright 2014 EMC Corporation. All rights reserved.
Real Time Data Streaming
+
Speakers:
Sumit Gupta, Data Intelligene Engineer, EMC
Kartikeya Putturaya, Data Intelligence Engineer, EMC
ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
2© Copyright 2014 EMC Corporation. All rights reserved.
Data Engineering at EMC IT
Stack
Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm
Messaging Systems: Rabbit MQ, Apache Kafka
Relation Store: Greenplum
A glimpse on what we do
Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange
servers in real time, with an analytics engine running on a 8 node cluster, processing
data volumes of ~100MB per 2 minutes
User Behavior Analytics for Network Threat Detection – Real time monitoring of
EMC’s internal networks and performing user behavior pattern analysis for threats,
again on a 8 node cluster, processing a stream of ~150MB of data any point of time
3© Copyright 2014 EMC Corporation. All rights reserved.
Predictive Maintenance of Exchange Servers
4© Copyright 2014 EMC Corporation. All rights reserved.
User Behavior Analytics for Network Threat Detection
5© Copyright 2014 EMC Corporation. All rights reserved.
Apache Kafka
6© Copyright 2014 EMC Corporation. All rights reserved.
Overview
An apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g. logs, metrics collections
• Written in Scala
• Does not follow JMS Standards, neither uses JMS APIs
Features
Persistent messaging
High-throughput
Supports both queue and topic semantics
Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)
and many more…
http://kafka.apache.org/
7© Copyright 2014 EMC Corporation. All rights reserved.
How it works
8© Copyright 2014 EMC Corporation. All rights reserved.
Real time transfer
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
9© Copyright 2014 EMC Corporation. All rights reserved.
Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a
partitioned log
10© Copyright 2014 EMC Corporation. All rights reserved.
Kafka Installation
Download
http://kafka.apache.org/downloads.html
Untar it
> tar -xzf kafka_<version>.tgz
> cd kafka_<version>
11© Copyright 2014 EMC Corporation. All rights reserved.
Start Servers
Start the Zookeeper server
> bin/zookeeper-server-start.sh config/zookeeper.properties
Pre-requisite: Zookeeper should be up and running.
Now Start the Kafka Server
> bin/kafka-server-start.sh config/server.properties
12© Copyright 2014 EMC Corporation. All rights reserved.
Create a topic
> bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions 1 --
topic test
List down all topics
> bin/kafka-topics.sh --list --zookeeper
localhost:2181
Output: test
Create/List Topics
13© Copyright 2014 EMC Corporation. All rights reserved.
Producer
Send some Messages
> bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Now type on console:
This is a message
This is another message
14© Copyright 2014 EMC Corporation. All rights reserved.
Consumer
Receive some Messages
> bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning
This is a message
This is another message
15© Copyright 2014 EMC Corporation. All rights reserved.
Copy configs
> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties
Changes in the config files.
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
Multi-Broker Cluster
16© Copyright 2014 EMC Corporation. All rights reserved.
Start other Nodes with new configs
> bin/kafka-server-start.sh config/server-1.properties &
> bin/kafka-server-start.sh config/server-2.properties &
Create a new topic with replication factor as 3
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --
replication-factor 3 --partitions 1 --topic my-replicated-topic
List down the all topics
> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-
replicated-topic
Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
Start with New Nodes
17© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming
Makes it easy to build scalable fault-tolerant streaming
applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive queries.
18© Copyright 2014 EMC Corporation. All rights reserved.
19© Copyright 2014 EMC Corporation. All rights reserved.
20© Copyright 2014 EMC Corporation. All rights reserved.
Spark Steaming Programming Model
Spark streaming provides a high level abstraction called
Discretized Stream or DStream
- represents a stream of data
- implemented as a sequence of RDDS
21© Copyright 2014 EMC Corporation. All rights reserved.
22© Copyright 2014 EMC Corporation. All rights reserved.
Spark Streaming + Kafka
There are two approaches to receive the data from Kafka for spark streaming
• Receiver based approach
• Direct approach
23© Copyright 2014 EMC Corporation. All rights reserved.
24© Copyright 2014 EMC Corporation. All rights reserved.
#import Streaming Context and KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 1)
#create KafkaStream by passing zookeeper server address and topic SparkStreaming
kvs = KafkaUtils.createStream(ssc, "localhost:2181",
"spark-streaming-consumer", {“sparkStream":1})
#lines Dstream from KafkaStream
lines = kvs.map(lambda x: x[1])
#count Dstream from lines Dstream
counts = lines.flatMap(lambda line: line.split(" "))  .map(lambda word: (word, 1))
 .reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start() ssc.awaitTermination()
25© Copyright 2014 EMC Corporation. All rights reserved.
26© Copyright 2014 EMC Corporation. All rights reserved.
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list":
brokers})
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream
.transform(storeOffsetRanges)
.foreachRDD(printOffsetRanges)
27© Copyright 2014 EMC Corporation. All rights reserved.
Thank You

Mais conteúdo relacionado

Mais procurados

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
 
(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private CloudAmazon Web Services
 
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OpenvSwitch
 
LF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK
 
(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion PacketsAmazon Web Services
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCynthia Thomas
 
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronUsing PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronVikram G Hosakote
 
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...Amazon Web Services
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNICIndonesia Network Operators Group
 
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OpenvSwitch
 
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLANIndonesia Network Operators Group
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker, Inc.
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK
 
Transforming Infrastructure into Code - Importing existing cloud resources u...
Transforming Infrastructure into Code  - Importing existing cloud resources u...Transforming Infrastructure into Code  - Importing existing cloud resources u...
Transforming Infrastructure into Code - Importing existing cloud resources u...Shih Oon Liong
 
Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Cumulus Networks
 
Addressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronAddressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronVikram G Hosakote
 

Mais procurados (20)

Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud(NET301) New Capabilities for Amazon Virtual Private Cloud
(NET301) New Capabilities for Amazon Virtual Private Cloud
 
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
LF_OVS_17_Enabling hardware acceleration in OVS-DPDK using DPDK Framework.
 
LF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDKLF_DPDK17_ OpenVswitch hardware offload over DPDK
LF_DPDK17_ OpenVswitch hardware offload over DPDK
 
(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets(NET403) Another Day, Another Billion Packets
(NET403) Another Day, Another Billion Packets
 
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPFCilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
Cilium – Kernel Native Security & DDOS Mitigation for Microservices with BPF
 
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack NeutronUsing PerfDHCP tool to scale DHCP in OpenStack Neutron
Using PerfDHCP tool to scale DHCP in OpenStack Neutron
 
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
AWS re:Invent 2016: Encryption: It Was the Best of Controls, It Was the Worst...
 
BEST REST in OpenStack
BEST REST in OpenStackBEST REST in OpenStack
BEST REST in OpenStack
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
02 - IDNOG04 - Sheryl Hermoso (APNIC) - IPv6 Deployment at APNIC
 
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload StatusLF_OVS_17_Red Hat's perspective on OVS HW Offload Status
LF_OVS_17_Red Hat's perspective on OVS HW Offload Status
 
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
20 - IDNOG03 - Franki Lim (ARISTA) - Overlay Networking with VXLAN
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Docker storage designing a platform for persistent data
Docker storage designing a platform for persistent dataDocker storage designing a platform for persistent data
Docker storage designing a platform for persistent data
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloadsLF_DPDK17_DPDK support for new hardware offloads
LF_DPDK17_DPDK support for new hardware offloads
 
Transforming Infrastructure into Code - Importing existing cloud resources u...
Transforming Infrastructure into Code  - Importing existing cloud resources u...Transforming Infrastructure into Code  - Importing existing cloud resources u...
Transforming Infrastructure into Code - Importing existing cloud resources u...
 
Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013Morphology of Modern Data Center Networks - YaC 2013
Morphology of Modern Data Center Networks - YaC 2013
 
Addressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack NeutronAddressing DHCP and DNS scalability issues in OpenStack Neutron
Addressing DHCP and DNS scalability issues in OpenStack Neutron
 

Semelhante a Real time data processing with kafla spark integration

Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Storage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyStorage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyEMC
 
Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceAntonio Romeo
 
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月VirtualTech Japan Inc.
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowCisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowLancope, Inc.
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANLdgoodell
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP serverKazuho Oku
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD{code}
 
Web scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelWeb scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelpurpleocean
 
Copr HD OpenStack Day India
Copr HD OpenStack Day IndiaCopr HD OpenStack Day India
Copr HD OpenStack Day Indiaopenstackindia
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Kendrick Coleman
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rulesFreddy Buenaño
 

Semelhante a Real time data processing with kafla spark integration (20)

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Storage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technologyStorage networking fcf_co_eiscsivsn_technology
Storage networking fcf_co_eiscsivsn_technology
 
Software Define your Current Storage with Opensource
Software Define your Current Storage with OpensourceSoftware Define your Current Storage with Opensource
Software Define your Current Storage with Opensource
 
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
OpenStackを利用したEnterprise Cloudを支える技術 - OpenStack最新情報セミナー 2016年5月
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlowCisco CSIRT Case Study: Forensic Investigations with NetFlow
Cisco CSIRT Case Study: Forensic Investigations with NetFlow
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHDEMC World 2016 - code.05 Automating your Physical Data Center with RackHD
EMC World 2016 - code.05 Automating your Physical Data Center with RackHD
 
Web scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannelWeb scale infrastructures with kubernetes and flannel
Web scale infrastructures with kubernetes and flannel
 
Copr HD OpenStack Day India
Copr HD OpenStack Day IndiaCopr HD OpenStack Day India
Copr HD OpenStack Day India
 
Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016Automating Your Data Center with RackHD - EMC World 2016
Automating Your Data Center with RackHD - EMC World 2016
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Real time data processing with kafla spark integration

  • 1. 1© Copyright 2014 EMC Corporation. All rights reserved. Real Time Data Streaming + Speakers: Sumit Gupta, Data Intelligene Engineer, EMC Kartikeya Putturaya, Data Intelligence Engineer, EMC ChandraSekarRao Venkata, Data Intelligence Engineer, EMC
  • 2. 2© Copyright 2014 EMC Corporation. All rights reserved. Data Engineering at EMC IT Stack Distributed Frameworks: Apache Spark, Pivotal Hadoop, Apache Storm Messaging Systems: Rabbit MQ, Apache Kafka Relation Store: Greenplum A glimpse on what we do Predictive Maintenance of Exchange Servers - Monitoring over 145 exchange servers in real time, with an analytics engine running on a 8 node cluster, processing data volumes of ~100MB per 2 minutes User Behavior Analytics for Network Threat Detection – Real time monitoring of EMC’s internal networks and performing user behavior pattern analysis for threats, again on a 8 node cluster, processing a stream of ~150MB of data any point of time
  • 3. 3© Copyright 2014 EMC Corporation. All rights reserved. Predictive Maintenance of Exchange Servers
  • 4. 4© Copyright 2014 EMC Corporation. All rights reserved. User Behavior Analytics for Network Threat Detection
  • 5. 5© Copyright 2014 EMC Corporation. All rights reserved. Apache Kafka
  • 6. 6© Copyright 2014 EMC Corporation. All rights reserved. Overview An apache project initially developed at LinkedIn Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs Features Persistent messaging High-throughput Supports both queue and topic semantics Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… http://kafka.apache.org/
  • 7. 7© Copyright 2014 EMC Corporation. All rights reserved. How it works
  • 8. 8© Copyright 2014 EMC Corporation. All rights reserved. Real time transfer Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 9. 9© Copyright 2014 EMC Corporation. All rights reserved. Kafka maintains a feed of messages in categories called topics. For each topic Kafka cluster maintains a partitioned log
  • 10. 10© Copyright 2014 EMC Corporation. All rights reserved. Kafka Installation Download http://kafka.apache.org/downloads.html Untar it > tar -xzf kafka_<version>.tgz > cd kafka_<version>
  • 11. 11© Copyright 2014 EMC Corporation. All rights reserved. Start Servers Start the Zookeeper server > bin/zookeeper-server-start.sh config/zookeeper.properties Pre-requisite: Zookeeper should be up and running. Now Start the Kafka Server > bin/kafka-server-start.sh config/server.properties
  • 12. 12© Copyright 2014 EMC Corporation. All rights reserved. Create a topic > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 -- topic test List down all topics > bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test Create/List Topics
  • 13. 13© Copyright 2014 EMC Corporation. All rights reserved. Producer Send some Messages > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test Now type on console: This is a message This is another message
  • 14. 14© Copyright 2014 EMC Corporation. All rights reserved. Consumer Receive some Messages > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message
  • 15. 15© Copyright 2014 EMC Corporation. All rights reserved. Copy configs > cp config/server.properties config/server-1.properties > cp config/server.properties config/server-2.properties Changes in the config files. config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Multi-Broker Cluster
  • 16. 16© Copyright 2014 EMC Corporation. All rights reserved. Start other Nodes with new configs > bin/kafka-server-start.sh config/server-1.properties & > bin/kafka-server-start.sh config/server-2.properties & Create a new topic with replication factor as 3 > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 3 --partitions 1 --topic my-replicated-topic List down the all topics > bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my- replicated-topic Topic:my-replicated-topic PartitionCount:1 ReplicationFactor:3 Configs: Topic: my-replicated-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0 Start with New Nodes
  • 17. 17© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. Ease of Use Fault Tolerance Combine streaming with batch and interactive queries.
  • 18. 18© Copyright 2014 EMC Corporation. All rights reserved.
  • 19. 19© Copyright 2014 EMC Corporation. All rights reserved.
  • 20. 20© Copyright 2014 EMC Corporation. All rights reserved. Spark Steaming Programming Model Spark streaming provides a high level abstraction called Discretized Stream or DStream - represents a stream of data - implemented as a sequence of RDDS
  • 21. 21© Copyright 2014 EMC Corporation. All rights reserved.
  • 22. 22© Copyright 2014 EMC Corporation. All rights reserved. Spark Streaming + Kafka There are two approaches to receive the data from Kafka for spark streaming • Receiver based approach • Direct approach
  • 23. 23© Copyright 2014 EMC Corporation. All rights reserved.
  • 24. 24© Copyright 2014 EMC Corporation. All rights reserved. #import Streaming Context and KafkaUtils from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils sc = SparkContext(appName="PythonStreamingKafkaWordCount") ssc = StreamingContext(sc, 1) #create KafkaStream by passing zookeeper server address and topic SparkStreaming kvs = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer", {“sparkStream":1}) #lines Dstream from KafkaStream lines = kvs.map(lambda x: x[1]) #count Dstream from lines Dstream counts = lines.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) counts.pprint() ssc.start() ssc.awaitTermination()
  • 25. 25© Copyright 2014 EMC Corporation. All rights reserved.
  • 26. 26© Copyright 2014 EMC Corporation. All rights reserved. from pyspark.streaming.kafka import KafkaUtils directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}) offsetRanges = [] def storeOffsetRanges(rdd): global offsetRanges offsetRanges = rdd.offsetRanges() return rdd def printOffsetRanges(rdd): for o in offsetRanges: print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset) directKafkaStream .transform(storeOffsetRanges) .foreachRDD(printOffsetRanges)
  • 27. 27© Copyright 2014 EMC Corporation. All rights reserved. Thank You